We consider the problem of training a shallow neural network with quadratic activation functions and the generalization power of such trained networks. Assuming that the samples are generated by a full rank matrix $W^{*}$ of the hidden network node weights, we obtain the following results. We establish that all full-rank approximately stationary solutions of the risk minimization problem are also approximate global optimums of the risk (in-sample and population). As a consequence, we establish that, when trained on polynomially many samples, the gradient descent algorithm converges to the global optimum of the risk minimization problem regardless of the width of the network when it is initialized at some value $ν^{*}$ , which we compute. Furthermore, the network produced by the gradient descent has a near zero generalization error. Next, we establish that initializing the gradient descent algorithm below $ν^{*}$ is easily achieved when the weights of the ground truth matrix $W^{*}$ are randomly generated and the matrix is sufficiently overparameterized. Finally, we identify a simple necessary and sufficient geometric condition on the size of the training set under which any global minimizer of the empirical risk has necessarily zero generalization error.

Funding: The research of E. C. Kizildag is supported by Columbia University, with the Distinguished Postdoctoral Fellowship in Statistics. Support from the National Science Foundation [Grant DMS-2015517] is gratefully acknowledged.

1. Introduction

Neural network architectures are demonstrated to be extremely powerful in practical tasks, such as natural language processing (Collobert and Weston [20]), image recognition (He et al. [46]), image classification (Krizhevsky et al. [53]), speech recognition (Mohamed et al. [63]), and game playing (Silver et al. [80]), and is becoming popular in other areas, such as applied mathematics (Chen et al. [17], Weinan et al. [94]), clinical diagnosis (De Fauw et al. [21]), and so on. Despite this empirical success, a rigorous mathematical understanding of these architectures is still an ongoing quest.

Despite the fact that it is NP-hard to train such architectures in the worst case setting, it has been observed empirically that the gradient descent (GD), albeit a simple, first order, local procedure, is rather successful in training such networks. This is somewhat surprising because of the highly nonconvex nature of the associated objective function. Our main motivation in this paper is to provide further insights into the optimization landscape and generalization abilities of these networks.

1.1. Model, Contributions, and Comparison with Prior Work

In this section, we introduce the model considered in this paper, describe our contributions, and discuss the relevant literature.

1.1.1. Model.

In this paper, we consider a shallow neural network architecture with one hidden layer of width m. Namely, the network consists of m neurons. We study it under the realizable model assumption, that is, the labels are generated by a teacher network with ground truth weight matrix $W^{*} \in R^{m \times d}$ whose $j th$ row $W_{j}^{*} \in R^{d}$ carries the weights of the $j th$ neuron and $m \geq d$ . We assume that the input data $X \in R^{d}$ consists of independent and identically distributed (i.i.d.) centered sub-Gaussian coordinates. It is worth noting that such shallow architectures with planted weights and Gaussian input data are explored extensively in the literature (see, e.g., Brutzkus and Globerson [11], Du et al. [28], Li and Yuan [56], Soltanolkotabi [82], Tian [89], Zhong et al. [99]).

Our focus is, in particular, on networks with quadratic activation, studied also by Soltanolkotabi et al. [83] and Du and Lee [24], among others. This object, an instance of what is known as a polynomial network (Livni et al. [59]), computes for every input data $X \in R^{d}$ the function

f (W^{*}; X) = \sum_{j = 1}^{m} {〈 W_{j}^{*}, X 〉}^{2} = {‖ W^{*} X ‖}_{2}^{2} .

(1)

We note that, albeit a stylized activation function, blocks of quadratic activations can be stacked together to approximate deeper networks with sigmoid activations as shown by Livni et al. [59], and furthermore, this activation serves as a second order approximation of general nonlinear activations as noted by Venturi et al. [90]. Thus, we study the quadratic networks as an attempt to gain further insights on more complex networks.

Let $X_{i} \in R^{d}, 1 \leq i \leq N$ be an i.i.d. collection of input data, and let $Y_{i} = f (W^{*}; X_{i})$ be the corresponding label generated per (1). The goal of the learner is as follows: given the training data $(X_{i}, Y_{i}) \in R^{d} \times R, 1 \leq i \leq N$ , find a weight matrix $W \in R^{m \times d}$ that explains the input–output relationship on the training data set in the best possible way, often by solving the so-called empirical risk minimization (ERM) optimization problem

\min_{W \in R^{m \times d}} \hat{L} (W) where \hat{L} (W) ≜ \frac{1}{N} \sum_{1 \leq i \leq N} (Y_{i} - f (W; X_{i}))^{2};

(2)

understand its generalization ability, quantified by the generalization error (also known as the population risk associated with any solution candidate

W \in R^{m \times d}

) that is given by

L (W) ≜ E [{(f (W^{*}; X) - f (W; X))}^{2}],

(3)

where the expectation is with respect to a fresh sample X, which has the same distribution as

X_{i}, 1 \leq i \leq N

but is independent from the sample. The landscape of the loss function

\hat{L} (\cdot)

is nonconvex, therefore rendering the optimization problem (potentially) difficult. Nevertheless, the gradient descent algorithm, despite being a simple first order procedure, is rather successful in training neural networks in general: it appears to find a

W \in R^{m \times d}

with near-optimal

\hat{L} (W) .

Our partial motivation is to investigate this phenomenon in the case in which the activation function is quadratic.

1.1.2. Contributions.

We first study the landscape of risk functions and quantify an energy barrier separating rank-deficient matrices from the full-rank planted weights. Specifically, if $W^{*} \in R^{m \times d}$ is full-rank, namely, rank d (recall $m \geq d$ ), then the risk function for any rank-deficient W is bounded away from zero by an explicit constant—independent of d—controlled by the smallest singular value $σ_{min} (W^{*})$ of $W^{*}$ as well as the second and fourth moments of the data. See Theorem 1 for the population and Theorem 2 for the empirical versions of this result. (Theorem 2 holds with high probability (w.h.p.) with respect to the observed sample.)

Next, we study the full-rank stationary points of the risk functions and the performance of the gradient descent algorithm. We first establish that, when $W^{*}$ is full rank, any full-rank stationary point W of the risk functions is necessarily a global minimum and any such W is of form $W = Q W^{*}$ , where $Q \in R^{m \times m}$ is orthonormal. See Theorem 3 for the population and Theorem 4 for the empirical versions. Namely, W is a global optimum up to a rotation. We then establish that all approximate stationary points (appropriately defined) W of $\hat{L} (\cdot)$ below the aforementioned energy barrier are nearly global optimum. Furthermore, we establish that, if the number N of samples is $poly (d)$ , then the weights W of any full-rank approximate stationary point are uniformly close to $W^{*}$ . As a corollary, gradient descent with initialization below the energy barrier recovers in time $poly (ϵ^{- 1}, d)$ a solution W for which the weights are ϵ-close to the planted weights. Consequently, the generalization error $L (W)$ for this solution W is at most ϵ. The bound on $L (W)$ is derived by controlling the condition number of a certain matrix whose i.i.d. rows consists of tensorized data $X_{i}^{\otimes 2}$ , using a recently developed machinery in Emschwiller et al. [33] studying the spectrum of expected covariance matrices of tensorized data. See Theorem 5 for the population and Theorem 6 for the empirical version.

Subsequently, we study the question of whether one can find the initialization of the gradient descent algorithm below the aforementioned energy barrier. We answer affirmatively this question in the context of randomly generated $W^{*} \in R^{m \times d}$ and establish in Theorem 8 that, as long as the network is sufficiently overparameterized, specifically $m > C d^{2}$ , for some sufficiently large constant C, it is possible to initialize W₀ such that w.h.p. the risk associated to W₀ is below the required threshold. This is achieved by using tools from random matrix theory, specifically a semicircle law for Wishart matrices that shows the spectrum of ${(W^{*})}^{T} W^{*}$ is tightly concentrated (Bai and Yin [3]). See Theorem 7 for the population and Theorem 8 for the empirical version. It is worth noting that neural networks with random weights is an active area of research by itself because of the relationship with random feature methods. For example, Rahimi and Recht [73] show that shallow architectures trained by choosing the internal weights randomly and optimizing only over the output weights return a classifier with reasonable generalization performance at accelerated training speed. Random shallow networks are also shown to well-approximate dynamical systems (Gonon et al. [40]), have been successfully employed in the context of extreme learning machines (Huang et al. [48]), and are studied in the context of random matrix theory; see Pennington and Worah [69] and references therein.

Our next focus is on the sample complexity for generalization. Even though we study the landscape of the empirical risk, it is not by any means certain that any (potentially not full-rank) optimizer of $\min_{W} \hat{L} (W)$ also achieves zero generalization error. We give necessary and sufficient conditions on the samples $X_{i}, 1 \leq i \leq N$ so that any minimizer has indeed zero generalization error in our setting. We show that, if $span (X_{i} X_{i}^{T} : 1 \leq i \leq N)$ is the space of all d × d-dimensional real symmetric matrices, then any global minimum of the empirical risk is necessarily a global optimizer of the population risk and, thus, has zero generalization error. Note that this geometric condition is not retrospective in manner: it can be checked ahead of the optimization procedure by computing $span (X_{i} X_{i}^{T} : 1 \leq i \leq N)$ . Conversely, we show that, if the preceding span condition is not met, then there exists a global minimum W of the empirical risk function that induces a strictly positive generalization error. This is established in Theorem 9.

To complement our analysis, we then ask the following question: what is the critical number $N^{*}$ of the training samples under which the (random) data $X_{i}, 1 \leq i \leq N$ enjoys the aforementioned span condition? We prove this number to be $N^{*} = d (d + 1) / 2$ under a very mild assumption that the coordinates of $X_{i} \in R^{d}$ are jointly continuous. This is shown in Theorem 10. Finally in Theorem 11, we show that, when $N < N^{*}$ , not only does there exist W with zero empirical risk and strictly positive generalization error, we also bound this error from below by an amount very similar to the bound for rank-deficient matrices discussed in our earlier Theorem 2.

We end with a comment on overparameterization and generalization. A common paradigm in statistical learning theory is that overparameterized models, that is, models with more parameters than necessary, although capable of interpolating the training data, tend to generalize poorly because of overfitting to the proposed model. Yet it is observed empirically that neural networks tend not to suffer from this complication (Zhang et al. [96]): despite being overparameterized, they seem to have a good generalization performance provided the interpolation barrier is exceeded. In Theorem 9(a), we establish the following result, which sheds some light on this phenomenon for the case of shallow neural networks with quadratic activations: suppose that the data enjoys the aforementioned geometric condition. Then, any interpolator achieves zero generalization error even when the interpolator is a neural network with a potentially larger number $\hat{m}$ of internal nodes compared with the one that generated the data, namely, by using a weight matrix $W \in R^{\hat{m} \times d}$ , where $\hat{m} \geq m$ . In other words, the model does not overfit when a much larger width of the interpolator is chosen at the learning state.

1.1.3. Comparison with Soltanolkotabi et al. [83] and Du and Lee [24].

We now make a comparison with two very related prior works, also studying quadratic activations. We start with the work by Soltanolkotabi et al. [83]. In Soltanolkotabi et al. [83, theorem 2.2], the authors study the empirical risk landscape of a slightly more general version of our model: $Y_{i} = \sum_{j = 1}^{m} v_{j}^{*} {〈 W_{j}^{*}, X_{i} 〉}^{2}$ , assuming $rank (W^{*}) = d$ like us and assuming all nonzero entries of $v^{*}$ have the same sign. Thus, our model is the special case in which all entries of $v^{*}$ are equal to unity. The authors establish that, as long as $d \leq N \leq c d^{2}$ for some small fixed constant c, every local minima of the empirical risk function is also a global minima (namely, there exists no spurious local minima), and furthermore, every saddle point has a direction of negative curvature. As a result, they show that gradient descent with an arbitrary initialization converges to a globally optimum solution of the ERM problem (2). In particular, their result does not require the initialization point to be below some risk value (the energy barrier) as in our case. Nevertheless, our results show that one needs not to worry about saddle points below the energy barrier as none exists per our Theorem 2. Importantly, though, the regime $N < c d^{2}$ for small c that Soltanolkotabi et al. [83, theorem 2.2] applies is below the provable sample complexity value $N^{*} = d (d + 1) / 2$ when the data are drawn from a continuous distribution as per our Theorem 10. In particular, as we establish when $N < N^{*}$ , the ERM problem (2) admits global optimum solutions with zero empirical risk value but with the generalization error bounded away from zero. Thus, the regime $N < N^{*}$ does not correspond to the regime in which solving the ERM has a guaranteed control on the generalization error. The same theorem in Soltanolkotabi et al. [83] also studies the approximate stationary points and shows that, for any such point W, the associated empirical risk, $\hat{L} (W)$ , is also small. Our Theorem 6, though, takes a step further and shows that not only is the empirical risk small, but the recovered W is close to planted weights $W^{*}$ , and therefore, it has a small generalization error $L (W)$ by explicitly bounding the generalization error from above.

It is also worth noting that, although not our focus in the present paper, Soltanolkotabi et al. [83, theorem 2.1] also studies the landscape of the empirical risk when a quadratic network model $X \mapsto \sum_{j = 1}^{m} v_{j}^{*} {〈 W_{j}^{*}, X 〉}^{2}$ is used for interpolating arbitrary input/label pairs $(X_{i}, Y_{i}) \in R^{d} \times R, 1 \leq i \leq N$ , that is, without making an assumption that the labels are generated according to a network with planted weights. They establish similar landscape results, namely, the absence of spurious local minima and the fact that every saddle point has a direction of negative curvature as long as the output weight $v^{*}$ has at least d positive and at least d negative entries (consequently, the width m has to be at least 2d). Even though this result does not assume any rank condition on W, the assumption on the minimum number of positive and minimum number of negative entries, such as the preceding one, is somewhat unnatural.

Yet another closely related work studying quadratic activations is the paper by Du and Lee [24], which focuses on the shallow architectures with all unity output weights as we do. This paper establishes that, for any smooth and convex loss $ℓ (\cdot, \cdot)$ , the landscape of the regularized loss function $\frac{1}{N} \sum_{i = 1}^{N} ℓ (f (W; X_{i}), Y_{i}) + \frac{λ}{2} {‖ W ‖}_{F}^{2}$ still admits aforementioned favorable geometric characteristics. Furthermore, as the learned weights are of bounded Frobenius norm thanks to the norm penalty ${‖ W ‖}_{F}^{2}$ imposed on the objective, they retain good generalization via Rademacher complexity considerations. Even though this work addresses the training and generalization error when the norm of W is controlled during training, it does not carry out approximate stationarity analysis as Soltanolkotabi et al. [83] and we do and does not study their associated loss/generalization as in our case. Even though they show that optimal solutions to the optimization problem incorporating bounded norms generalize well, it remains unclear from their analysis whether the approximate stationary points of this objective also have a well-controlled norm.

It is worth mentioning that the two main directions that we undertake in this paper were not explored in either of these two prior works. These include the direction pertaining to the initialization (Theorems 7 and 8) and the direction pertaining to the sample complexity (Theorems 9–11). The latter direction relates to an interesting interpolation/overparameterization property that we have discussed before. We return later to this direction in Section 2.3 after we present Theorem 9.

1.1.4. Connection to Matrix Sensing.

The problem of learning shallow quadratic networks with planted weights $W^{*}$ is closely related to the matrix sensing problem. This problem has many applications, spanning from video and image processing to control and sensor networks (see Deng et al. [23], Qin et al. [72], Zhong et al. [97] and references therein); we now elaborate on the connection between our setting and the matrix sensing problem. Given an unknown matrix $A^{*} \in R^{d_{1} \times d_{2}}$ , the goal of the matrix sensing problem is to recover $A^{*}$ from a (small) number of linear measurements of form $〈 Z_{i}, A^{*} 〉, 1 \leq i \leq N$ , where $Z_{i} \in R^{d_{1} \times d_{2}}, 1 \leq i \leq N$ , and $〈 Z_{i}, A^{*} 〉 = trace (Z_{i}^{T} A^{*})$ . Recalling (1), our data model corresponds to

Y_{i} = {‖ W^{*} X_{i} ‖}_{2}^{2} = X_{i}^{T} {(W^{*})}^{T} W^{*} X_{i} = 〈 {(W^{*})}^{T} W^{*}, X_{i} X_{i}^{T} 〉 .

Namely, the problem that we study in the present paper is an instance of the matrix sensing problem; the goal is to recover $A^{*} = {(W^{*})}^{T} W^{*}$ from rank-1 measurements of form $X_{i} X_{i}^{T}$ , where $X_{i} \in R^{d}$ have centered i.i.d. sub-Gaussian coordinates. A related prior work by Zhong et al. [97] studies the problem of sensing a low-rank matrix from rank-1 measurements of form $X_{i} Y_{i}^{T}$ , where $X_{i}, Y_{i} \in R^{d}$ are i.i.d. and they consist of centered i.i.d. sub-Gaussian coordinates (see also Deng et al. [23] for an improvement of Zhong et al. [97] in terms of the time and sample complexity). Notice that Zhong et al. [97] studies a slightly different measurement matrix of form $X_{i} Y_{i}^{T}$ (as opposed to $X_{i} X_{i}^{T}$ ). More importantly Zhong et al. [97] are concerned with recovering a low-rank $A^{*}$ , whereas the fact that ${(W^{*})}^{T} W^{*}$ is full-rank is crucial for our techniques; see subsequent details. A very recent work by Qin et al. [72] relaxes the low-rank assumption and studies measurement matrices of form $u_{i} u_{i}^{T}$ for $u_{i} \in R^{d}$ that are nearly orthogonal and give recovery guarantees for the stochastic gradient descent. Note that, for $X_{i} \in R^{d}$ as before, standard concentration arguments show that the collection X_i, $1 \leq i \leq N$ , consists (w.h.p.) of nearly orthogonal vectors. Therefore, the setting studied in Qin et al. [72] is very similar to ours, and their sample complexity guarantee regarding the gradient descent appears to have a better dependence on dimension than ours. On the other hand Qin et al. [72] focus only on the problem of finding an $\hat{A}$ such that

\sum_{1 \leq i \leq N} {(〈 \hat{A}, X_{i} X_{i}^{T} 〉 - 〈 A^{*}, X_{i} X_{i}^{T} 〉)}^{2}

is small; it is not clear from their analysis whether

\hat{A}

is also close to ground truth

A^{*}

, that is, whether

‖ \hat{A} - A^{*} ‖

is small. This is one of the main directions we explore in the present paper. For more on the matrix sensing problem, see Recht et al. [74], Zhong et al. [97], Deng et al. [23], Qin et al. [72], and the references therein.

1.1.5. Further Relevant Prior Work.

As noted in the introduction, neural networks achieved remarkable empirical success, which fueled research starting from the expressive ability of these networks, going as early as Barron [5]. More recent works along this front focus on deeper and sparser models; see, for example, Mhaskar et al. [62], Telgarsky [88], Eldan and Shamir [31], Schmidt-Hieber [78], Poggio et al. [70], and Bölcskei et al. [10]. In particular, the expressive power of such network architectures is relatively well understood. Another issue pertaining to such architectures is their computational tractability: Blum and Rivest [9] establish that it is NP-complete to train a very simple three-node network whose nodes compute a linear thresholding function. Despite this worst case result, it is observed empirically that local search algorithms, such as GD, are rather successful in training. Even though several authors, including Sedghi and Anandkumar [79], Janzamin et al. [49], and Goel et al. [38], devise provable training algorithms for such networks, these algorithms unfortunately are based on methods other than the gradient descent, thus not shedding any light on its apparent empirical success.

On a parallel front, many papers study the behavior of the GD by analyzing the trajectory of it or its stochastic variant (SGD) under certain stylistic assumptions on the data as well as the network. These assumptions include Gaussian inputs, shallow networks (with or without the convolutional structure), and the existence of planted weights (the so-called teacher network) generating the labels. Some partial and certainly very incomplete references to this end include Tian [89], Brutzkus and Globerson [11], Brutzkus et al. [12], Zhong et al. [98], Soltanolkotabi [82], Li and Yuan [56], and Du et al. [28]. Later work relaxes the distributional assumptions. For instance, Du et al. [25] study the problem of learning a convolutional unit with ReLU with no specific distributional assumption on input and establish the convergence of SGD with rate depending on the smoothness of the input distribution and the closeness of the patches. Several other works along this line, in particular under the presence of overparameterization, are the works by Du et al. [26, 27].

Yet another line of research on the optimization front, rather than analyzing the trajectory of the GD, focuses on the mean-field analysis. Empirical distribution of the parameters of network with infinitely many internal nodes can be described as a Wasserstein gradient flow, and thus, some tools from the theory of optimal transport can be used; see, for example, Wei et al. [93], Rotskoff and Vanden-Eijnden [75], Chizat and Bach [18], Song et al. [84], and Sirignano and Spiliopoulos [81]. Albeit explaining the story to some extent for infinitely wide networks, it remains unclear whether these techniques provide results for a more realistic network model with finitely many internal nodes.

As noted earlier, the optimization landscape of such networks is usually highly nonconvex. More recent research on such nonconvex objectives shows that, if the landscape has certain favorable geometric properties, such as the absence of spurious local minima and the existence of direction with negative curvature for every saddle point, local methods can escape the saddle points and converge to the global minima. Examples of this line of research on loss functions include Ge et al. [37], Levy [55], Lee et al. [54], Jin et al. [50], and Du et al. [29]. Motivated by this front of research, many papers analyze geometric properties of the optimization landscape, including Poston et al. [71], Haeffele et al. [43], Choromanska et al. [19], Haeffele and Vidal [42], Kawaguchi [51], Hardt and Ma [44], Soudry and Carmon [85], Freeman and Bruna [34], Zhou and Feng [100], Nguyen and Hein [67, 68], Ge et al. [36], Safran and Shamir [77], Soudry and Hoffer [86], Zhou and Liang [101], Venturi et al. [90], Du et al. [24], and Soltanolkotabi et al. [83].

We now touch upon yet another very important focus, that is, the generalization ability of such networks: how well a solution found, for example, by GD, predicts unseen data? A common paradigm in statistical learning theory that was mentioned previously is that overparameterized models tend to generalize poorly. Yet neural networks tend to not suffer from this complication (Zhang et al. [96]). Because the Vapnik–Chervonenkis dimension of these networks grows (at least) linear in the number of parameters (Bartlett et al. [7], Harvey et al. [45]), standard Vapnik–Chervonenkis theory does not help explaining the good generalization ability under presence of overparameterization. This is studied, among others, through the lens of the norm of weight matrices (Bartlett et al. [6], Dziugaite and Roy [30], Golowich et al. [39], Liang et al. [58], Neyshabur et al. [65], Wu et al. [95]), PAC-Bayes theory (Neyshabur et al. [64, 66]), and compression-based bounds (Arora et al. [1]). A main drawback is that these papers require some sort of constraints on the weights and are mostly a posteriori: whether a good generalization takes place can be determined only when the training process is finished. A recent work by Arora et al. [2] provides an a priori guarantee for the solution found by the GD. Our result regarding the generalization guarantee described in Theorem 11 also provides simple a priori guarantee on the generalization.

1.1.6. A Follow-up Work.

After our paper appeared on arXiv, a follow-up work was done by Mannelli et al. [60]. In this paper, the authors consider the same architecture (namely, a shallow network with quadratic activations) under the so-called teacher/student setting and study the landscape of the empirical risk as well as the performance of the gradient flow, the continuous-time analogue of the gradient descent. Importantly, they consider also the regime in which the number $m^{*}$ of the hidden units of the teacher network is less than the dimension d (whereas our focus is on $m^{*} \geq d$ ). In particular, when $m^{*} = 1$ , the width m of the student network is at least d, and the data consists of i.i.d. standard normal entries; they prove the following. In the limit as $d \to \infty$ , if $n > 2 d$ , then with positive probability the only minimizer of the empirical risk is the matrix $A^{*}$ of teacher weights itself, whereas for $n < 2 d$ , the empirical risk admits spurious minima with probability tending to one. (Namely, the geometry of the empirical risk undergoes a phase transition as $α ≜ n / d$ crosses $α_{c} = 2$ ). Moreover, they also prove that, for $m \geq d$ , the gradient flow converges to a global minima of the empirical risk and to the global minimum of the population risk (which is $A^{*}$ ) and characterize the rate of convergence for the latter case. (It is worth noting that running gradient flow on the population risk can be perceived as running it on the empirical risk in the limit of the large number n of samples.)

1.1.7. Paper Organization.

In Section 2.1, we present our main results on the landscape of the risk functions, including our energy barrier result for rank-deficient matrices, our result about the absence of full-rank stationary points of the risk function except the globally optimum points, and our result on the convergence of gradient descent. In Section 2.2, we present our results regarding randomly generated weight matrices $W^{*}$ and sufficient conditions for good initializations. In Section 2.3, we study the critical number of training samples guaranteeing good generalization property. We collect useful auxiliary lemmas in Section 3 and provide the proofs of all of our results in Section 4.

1.1.8. Notation.

The set of reals, positive reals, and the set ${1, 2, \dots, k}$ are denoted, respectively, by $R, R_{+}$ , and $[k]$ . For any matrix A, its smallest and largest singular values, spectrum, trace, Frobenius, and spectral norm are denoted, respectively, by $σ_{min} (A), σ_{max} (A), σ (A), trace (A), {‖ A ‖}_{F}$ , and ${‖ A ‖}_{2}$ . We denote the n × n identity matrix by I_n. Planted weights are denoted with an asterisk, for example, $W^{*}$ . $\exp (α)$ denotes $e^{α}$ . Given any $v \in R^{n}, {‖ v ‖}_{2}$ denotes its Euclidean $ℓ_{2}$ norm $\sqrt{\sum_{1 \leq i \leq n} v_{i}^{2}}$ . Given two vectors $x, y \in R^{n}$ , their Euclidean inner product $\sum_{1 \leq i \leq n} x_{i} y_{i}$ is denoted by $〈 x, y 〉$ . Given a collection $Z_{1}, \dots, Z_{k}$ of objects of the same kind (e.g., vectors or matrices), $span (Z_{i} : i \in [k])$ is the set, ${\sum_{j = 1}^{k} α_{j} Z_{j} : α_{j} \in R}$ . We say a random variable X is centered if $E [X] = 0$ . $Θ (\cdot), O (\cdot), o (\cdot)$ , and $Ω (\cdot)$ are standard (asymptotic) order notations for comparing the growth of two sequences. $\hat{L} (\cdot), \nabla \hat{L} (\cdot), L$ , and $\nabla L$ denote, respectively, the empirical risk, its gradient, and the population risk and its gradient.

2. Main Results

Our main results are now in order.

2.1. Optimization Landscape

2.1.1. Existence of an Energy Barrier.

Our first result shows the presence of an energy barrier in the landscape of the population risk $L (\cdot)$ below which any rank-deficient $W \in R^{m \times d}$ ceases to exist.

Theorem 1.

Suppose that $X \in R^{d}$ has i.i.d. centered coordinates with variance μ₂, (finite) fourth moment μ₄, $rank (W^{*}) = d$ , and let $L (W)$ be defined as (3).

(Lower bound) It holds that
$\min_{W \in R^{m \times d} : rank (W) < d} L (W) \geq \min {μ_{4} - μ_{2}^{2}, 2 μ_{2}^{2}} \cdot σ_{min} {(W^{*})}^{4} .$
(Tightness) There exists a matrix $W \in R^{m \times d}$ such that $rank (W) \leq d - 1$ and
$L (W) \leq \max {μ_{4}, 3 μ_{2}^{2}} \cdot σ_{min} {(W^{*})}^{4} .$

The proof of Theorem 1 is deferred to Section 4.2. Two remarks are in order.

First, the hypothesis of Theorem 1 holds under mild distributional assumptions on the coordinates of data: a finite fourth moment and zero mean suffices.

Second, part (b) of Theorem 1 implies that our lower bound on the energy value is tight up to a multiplicative constant determined by the moments of the data. That is, there exists a W with $rank (W) \leq d - 1$ such that $L (W) = Θ (σ_{min} {(W^{*})}^{4})$ , where the asymptotic $Θ (\cdot)$ hides the constants μ₂ and μ₄.

Our next result is an analogue of Theorem 1 for the empirical risk $\hat{L} (\cdot)$ , and it establishes the presence of a similar energy barrier in the landscape of the empirical risk $\hat{L} (\cdot)$ below which any rank-deficient $W \in R^{m \times d}$ ceases to exist with high probability.

Theorem 2.

Let K > 0 be an arbitrary constant and $X_{i} \in R^{d}, 1 \leq i \leq N$ be a collection of i.i.d. random vectors each having centered i.i.d. sub-Gaussian coordinates. That is, for some C > 0, $P (| X (j) | > t) \leq \exp (- C t^{2})$ for every $t \geq 0, j \in [d]$ . Suppose, furthermore, that, for every M > 0, the distribution of X(j), conditional on $| X (j) | \leq M$ , is centered: $E [X (j) | | X (j) | \leq M] = 0$ . Let $Y_{i} = f (W^{*}; X_{i}), 1 \leq i \leq N$ be the corresponding label generated by a planted teacher network per (1), where $rank (W^{*}) = d$ and ${‖ W^{*} ‖}_{F} \leq d^{K}$ . Then, for some absolute constants $C_{3}, C' > 0$ with probability at least

1 - \exp (- C' d) - {(9 d^{4 K + 9})}^{d^{2} - 1} \cdot \exp (- C_{3} N d^{- 4 - 4 K}) - N d e^{- C d},

it holds that

\min_{W \in R^{m \times d} : rank (W) \leq d - 1} \hat{L} (W) ≜ \min_{W \in R^{m \times d} : rank (W) \leq d - 1} \frac{1}{N} \sum_{1 \leq i \leq N} (Y_{i} - f (W; X_{i}))^{2} \geq \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4} .

Here,

C_{5} = \min {μ_{4} (1 / 2) - μ_{2} {(1 / 2)}^{2}, 2 μ_{2} {(1 / 2)}^{2}}, where μ_{t} (1 / 2) = E [X {(j)}^{t} | | X (j) | \leq d^{1 / 2}] .

Furthermore, if the dimension d of data is constant ( $d = O (1)$ ), then with probability $1 - O (1 / N)$ ,

\min_{W \in R^{m \times d} : rank (W) \leq d - 1} \hat{L} (W) \geq \frac{1}{2} \bar{C_{5}} σ_{min} {(W^{*})}^{4},

where

\bar{C_{5}} = \min {μ_{4} - μ_{2}^{2}, 2 μ_{2}^{2}} and μ_{n} = E [X {(j)}^{n}] .

The proof of Theorem 2 is provided in Section 4.9. Several remarks are now in order.

Assuming d is large, Theorem 2 shows that, with high probability, $\hat{L} (W)$ is bounded away from zero by an explicit constant for any W that is rank-deficient provided $N = d^{O (1)}$ , where the O(1) term depends on K. Furthermore, provided N is a sufficiently large polynomial-in-d quantity, the probability estimate is of form $1 - \exp (- Θ (d))$ , which is exponential in the dimension.

Note that one indeed needs a finite d correction for the case when the data are low-dimensional, $d = O (1)$ : for $d = O (1)$ , the term $N d \exp (- Θ (d))$ makes the probability estimate vacuous. The constant $\bar{C_{5}}$ appearing in this case is precisely the same constant appearing in Theorem 1, and in particular, no conditioning is required. Furthermore, even though we establish the probability estimate to be $1 - O (1 / N)$ for simplicity, it can be improved: it appears, from our analysis (which uses Chebyshev’s inequality), that for any $α > 0$ , one can show the probability estimate to be $1 - O (N^{- α})$ . Furthermore, using more elaborate tools (such as concentration for heavy-tailed variables, in particular, for i.i.d. averages of fourth moments of sub-Gaussian variables), this estimate can potentially be improved to $1 - \exp (- {\bar{c}}_{0} N^{c_{0}})$ for suitable constants $c_{0}, {\bar{c}}_{0} > 0$ . Finally, our analysis yields also that the $1 - O (1 / N)$ probability estimate still remains valid even when X_i has centered i.i.d coordinates with a finite eight moment. That is, the sub-Gaussianity assumption can be relaxed. Indeed, the expression $\hat{L} (W)$ is an i.i.d. sum of the form $N^{- 1} \sum_{1 \leq i \leq N} {(X_{i}^{T} M X_{i})}^{2}$ for a suitable matrix M, and one needs the finiteness of $E [{(X^{T} M X)}^{4}]$ to apply Chebyshev’s inequality: this quantity is finite provided $E [X_{i} {(j)}^{8}] < \infty$ and ${‖ M ‖}_{F} = O (1)$ . We do not pursue these extensions for keeping the presentation clear.

We now discuss the assumption that the coordinates of X_i are conditionally centered. This assumption is not inherent to the problem; it is adopted purely for technical reasons. Toward Theorem 2, one needs to establish the existence of an analogous energy barrier for a single rank-deficient $A \in R^{d \times d}$ and should in particular study an event of form

E_{A} = {\frac{1}{N} \sum_{1 \leq i \leq N} {(Y_{i} - X_{i}^{T} A X_{i})}^{2} \geq \frac{1}{2} C_{5} (K_{1}) σ_{min} {(W^{*})}^{4}},

where

C_{5} (K_{1})

is a certain constant; see Lemma 2. To control

P [E_{A}]

, we first condition on an event of the form

{\max_{i} {‖ X_{i} ‖}_{\infty} \leq d^{K_{1}}}

on which the coordinates of X_i are bounded, and we then apply Hoeffding’s inequality to obtain probabilistic guarantees that are exponential in the sample size N. Here, the fact that the coordinates of X_i are conditionally centered makes Theorem 1 applicable; one can then deduce that the conditional mean of

{(X^{T} A X - X^{T} {(W^{*})}^{T} W^{*} X)}^{2}

is at least

C_{5} (K_{1}) σ_{min} {(W^{*})}^{4}

. Without this assumption, it appears difficult to study the aforementioned probability, and one needs more sophisticated tools. One potential approach is to apply Jensen’s inequality to control the difference between the output of the model with the planted model and then to apply Mendelson’s [61] small ball method; such an argument might, in fact, help relaxing distributional assumptions on X_i. (We thank the anonymous reviewer for suggesting this argument.) The assumption that the conditional mean of

X_{i} (j)

is zero is benign; it holds, for example, for random variables whose distributions are symmetric around zero. For instance, any normal distribution with zero-mean satisfies this assumption. Furthermore, Theorem 2 still remains valid even when the coordinates of the data have heavier tails: our techniques apply also to the tails of form

P (| X_{i} (j) | > t) \leq \exp (- C t^{α})

, where

C, α > 0

are arbitrary. We use α = 2 throughout for simplicity.

We next comment on the dependence on ${‖ W^{*} ‖}_{F}$ through the constant K. The (value of) energy barrier is independent of this gap, whereas the probability guarantee does depend on it. This is an artifact of our proof technique. Our proof is based on an ϵ-net argument for a set of rank-deficient matrices with bounded ${‖ A ‖}_{F}$ together with a union bound. The cardinality bound we employ on this net per Lemma 3 depends on d^K; consequently, this constant appears in the probability guarantee. Once again, it might be possible to avoid this by using a different argument as outlined earlier.

An inspection of the proofs of Theorems 1 and 2 yield that the landscapes of the corresponding risks still admit an energy barrier even if we consider the same network architecture with planted weight matrix $W^{*} \in R^{m \times d}$ and quadratic activation function having lower order terms, that is, the activation $α x^{2} + β x + γ$ with $α \neq 0$ . In this case, in addition to $σ_{min} (W^{*})$ and the corresponding moments of the data, the coefficient α also appears in the barrier expression. In particular, Theorem 1 still remains valid with $\min {μ_{4} - μ_{2}^{2}, 2 μ_{2}^{2}} \cdot σ_{min} {(W^{*})}^{4}$ replaced with $α^{2} \cdot \min {μ_{4} - μ_{2}^{2}, 2 μ_{2}^{2}} \cdot σ_{min} {(W^{*})}^{4}$ , and Theorem 2 still remains valid with $\frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4}$ replaced with $\frac{α^{2}}{2} C_{5} σ_{min} {(W^{*})}^{4}$ .

2.1.2. Global Optimality of Full-Rank Stationary Points.

We now establish that, if W is a full-rank stationary point of the population risk, $L (\cdot)$ , then W is necessarily a global minimum.

Theorem 3.

Suppose $W^{*} \in R^{m \times d}$ with $rank (W^{*}) = d$ . Suppose $X \in R^{d}$ has centered i.i.d. coordinates with $E [X_{i}^{2}] = μ_{2}, E [X_{i}^{4}] = μ_{4}$ , and $Var (X_{i}^{2}) > 0$ . Let $W \in R^{m \times d}$ be a stationary point of the population risk with full-rank, that is, $\nabla L (W) = E [\nabla {(f (W^{*}; X) - f (W; X))}^{2}] = 0$ , and $rank (W) = d$ . Then, $W = Q W^{*}$ for some orthogonal matrix Q and that $L (W) = 0$ .

The proof of Theorem 3 is deferred to Section 4.6.

Our next result is an analogue of Theorem 3 for the empirical risk, $\hat{L} (\cdot)$ , and shows that, if $N \geq d (d + 1) / 2$ and W is any full-rank stationary point of the empirical risk, then W is necessarily a global minimum.

Theorem 4.

Let $X_{i} \in R^{d}, 1 \leq i \leq N$ , $W^{*} \in R^{m \times d}$ with $rank (W^{*}) = d$ , and suppose W is a full-rank stationary point of the empirical risk: $rank (W) = d$ , and $\nabla_{W} \hat{L} (W) = 0$ . Then, $\hat{L} (W) = 0$ . Furthermore, if X_i are i.i.d. random vectors having a jointly continuous distribution and $N \geq d (d + 1) / 2$ , then with probability one, $W = Q W^{*}$ for some orthogonal matrix $Q \in R^{m \times m}$ .

The proof of Theorem 4 is given in Section 4.10.

Note that an implication of Theorems 3 and 4 is that the corresponding losses admit no full-rank saddle points. Namely, the landscape of the corresponding losses has fairly benign properties below the aforementioned energy barrier. We soon show how this implies the convergence of gradient descent in the next section.

2.1.3. Convergence of Gradient Descent.

We now combine Theorems 1 and 3 to obtain the following conclusion on the performance of the gradient descent algorithm for the population risk. Suppose that the algorithm is initialized at a point W₀ having a small population risk $L (W_{0})$ , in particular, lower than the smallest risk value achieved by the rank-deficient matrices. Then, with a properly chosen step size, the algorithm converges to a global optimum: it generates a trajectory ${W_{k}}_{k \geq 0}$ of weights such that $\lim_{k \to \infty} L (W_{k}) = \min_{W} L (W) = 0$ .

Theorem 5.

Let $W_{0} \in R^{m \times d}$ be a matrix of weights with the property that

L (W_{0}) < \min_{W \in R^{m \times d} : rank (W) < d} L (W) .

Define

L = \sup {‖ \nabla^{2} L (W) ‖ : L (W) \leq L (W_{0})},

where by

‖ \nabla^{2} L (W) ‖

, we denote the spectral norm of the matrix

\nabla^{2} L (W)

. Then,

L < \infty

and the gradient descent algorithm with initialization

W_{0} \in R^{m \times d}

and any fixed step size of

0 < η < 1 / 2 L

generates a trajectory

{W_{k}}_{k \geq 0}

of weights such that

\lim_{k \to \infty} L (W_{k}) = \min_{W} L (W) = 0

The proof of Theorem 5 is provided in Section 4.7.

Our next focus is on the performance of the gradient descent algorithm for the empirical risk. By combining Theorems 2 and 4, we obtain the following conclusions. Suppose that the gradient descent algorithm is initialized at a point with a sufficiently small empirical risk, in particular, lower than the smallest risk value achieved by rank-deficient matrices (i.e., the energy barrier), and fix an $ϵ > 0$ . Then, with a properly chosen step size, it finds an approximate stationary point W (that is, a $W \in R^{m \times d}$ with a small ${‖ \nabla \hat{L} (W) ‖}_{F}$ ) in time $poly (ϵ^{- 1}, σ_{min} {(W^{*})}^{- 1}, d)$ for which the weights W^TW are uniformly ϵ-close to the planted weights ${(W^{*})}^{T} W^{*}$ , and consequently, the generalization error $L (W)$ is at most ϵ. Furthermore, the algorithm converges to a global optimum of the empirical risk minimization problem $\min_{W} \hat{L} (W)$ , which is zero, thus recovering planted weights because of the absence of spurious stationary points within the set of full-rank matrices.

Theorem 6.

Let $ϵ, K > 0$ be arbitrary. Suppose that $X_{i} \in R^{d}, 1 \leq i \leq N$ satisfies the assumptions in Theorem 2; $W_{0} \in R^{m \times d}$ is a matrix of weights with the property

\hat{L} (W_{0}) < \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4},

where C₅ is the constant defined in Theorem 2 and

{‖ W^{*} ‖}_{F} \leq d^{K}

. Define

L = L (W_{0}) ≜ \sup {‖ \nabla^{2} \hat{L} (W) ‖ : \hat{L} (W) \leq \hat{L} (W_{0})},

where by

‖ \nabla^{2} \hat{L} (W) ‖

, we denote the spectral norm of the (Hessian) matrix

\nabla^{2} \hat{L} (W)

and

ζ ≜ \min {\frac{ϵ \cdot σ_{min} {(W^{*})}^{2}}{32 d^{4 K + 4}}, \frac{ϵ^{2} \cdot σ_{min} {(W^{*})}^{2}}{{(C')}^{2} d^{4 K + 15}}, \frac{ϵ \cdot σ_{min} (W^{*})}{2 {(C')}^{2} μ_{2}^{2} d^{4 K + 16}}},

(4)

where

C' > 0

is some absolute constant, and

μ_{2} = E [X {(j)}^{2}]

. Then, with probability at least

1 - \exp (- c' N^{1 / 4}) - {(9 d^{4 K + 9})}^{d^{2} - 1} \cdot \exp (- C_{4} N d^{- 4 - 4 K}) - N d \exp (- C d),

(where

c', C, C_{4} > 0

are also absolute constants),

For any W with $\hat{L} (W) < \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4}$ , ${‖ W ‖}_{F} \leq d^{K + 1}$ . Moreover, $L = poly (d) < \infty$ .
Running gradient descent algorithm starting from W₀ with a step size of $0 < η < 1 / 2 L$ generates a full-rank $W \in R^{m \times d}$ with ${‖ \nabla \hat{L} (W) ‖}_{F} \leq ζ$ in time $poly (ϵ^{- 1}, σ_{min} {(W^{*})}^{- 1}, d)$ . Furthermore, for this W, $\hat{L} (W) \leq ϵ$ .
For W in (b), it holds that ${‖ W^{T} W - {(W^{*})}^{T} W^{*} ‖}_{F} \leq ϵ$ , and the generalization error $L (W)$ is at most ϵ provided $N \geq d^{58 / 3}$ .

Furthermore, suppose the data dimension d is constant, $d = O (1)$ , and $W_{0} \in R^{m \times d}$ is a matrix of weights with the property $\hat{L} (W_{0}) < \frac{1}{2} \bar{C_{5}} σ_{min} {(W^{*})}^{4}$ for the same constant $\bar{C_{5}}$ appearing in Theorem 2. Then, for ζ chosen per (4), with probability at least $1 - O (1 / N)$ ,

For any W with $\hat{L} (W) \leq \frac{1}{2} \bar{C_{5}} σ_{min} {(W^{*})}^{4}$ , it is the case that ${‖ W ‖}_{F} = O (1)$ . Moreover, $L = O (1) < \infty$ .
A running gradient descent algorithm starting from W₀ with a step size of $0 < η < 1 / 2 L$ generates a full-rank W with $‖ \nabla \hat{L} (W) ‖ \leq ζ$ and $\hat{L} (W) \leq ϵ$ .
For any W in (b), $\max {{‖ W^{T} W - {(W^{*})}^{T} W^{*} ‖}_{F}, L (W)} \leq ϵ$ .

The proof of Theorem 6 is provided in Section 4.11. Several remarks are now in order.

As a simple corollary to (b), we, thus, obtain that, for d large enough, the gradient descent algorithm with initialization $W_{0} \in R^{m \times d}$ and a step size of $0 < η < 1 / 2 L$ generates a trajectory ${W_{k}}_{k \geq 0}$ of weights such that $\lim_{k \to \infty} \hat{L} (W_{k}) = \min_{W} \hat{L} (W) = 0$ . This yields an analogue of Theorem 5 for the empirical risk, $\hat{L} (\cdot)$ .

Note that, when N is a sufficiently large polynomial-in-d function, the probability estimate for the first part is essentially $1 - \exp (- Θ (d))$ . Furthermore, we note that the exponent 1/4 in the probability estimate and the sample bound $d^{58 / 3}$ are required only for part (c) and can potentially be improved. In particular, the exponent can be improved to one for parts (a), (b), and (d).

We now expand on the sample complexity $d^{58 / 3}$ per part $(c)$ , which appears rather poor. In light of Soltanolkotabi et al. [83], a much smaller sample complexity of $O (d^{2})$ suffices to obtain an approximate stationary point W. Our work, on the other hand, takes a step further: we control the deviation of W^TW from ${(W^{*})}^{T} W^{*}$ and, subsequently, bound the generalization error. As we mention earlier, we are unaware of similar guarantees in the literature; the prior work, in fact, seems to focus only on minimizing the training error. This extension is based on a certain technical result established in Emschwiller et al. [33] (see the following for details), and the sample bound is likely to be an artifact of this approach. Having said this, it might though be possible to improve the dependence on dimension d. As a potential avenue, one may consider leveraging techniques from the phase retrieval literature, such as PhaseLift (Candès and Li [13], Candès et al. [15], Demanet and Hand [22], Eldar and Mendelson [32]), or from the matrix sensing literature mentioned earlier. It appears that some of these algorithms solve a certain convex relation, and the fact that the underlying matrix to be recovered is low rank is crucial for this convex program. In our case, $W^{*}$ and, therefore, ${(W^{*})}^{T} W^{*}$ are full rank, so it is not clear whether these algorithms would apply as is. In fact, as mentioned earlier, closely related works in the matrix sensing literature address a (variant of) an empirical risk minimization problem though they do not quantify how close the obtained solution and the ground truth are. The connection between our model and the phase retrieval/matrix sensing literature is very intriguing; it is plausible that the algorithms in the latter domain might apply to our case. This direction merits further investigation and is left for future work.

Note again that, analogous to Theorem 2, one needs a correction for the finite d case because the term $N d \exp (- Θ (d))$ makes the probability estimate vacuous for $d = O (1)$ . Furthermore, the remarks following Theorem 2 still remain valid. In particular, the choice of $1 - O (1 / N)$ is for simplicity, and the estimate can be improved almost immediately to $1 - O (N^{- α})$ for any $α > 0$ and to $1 - \exp (- {\bar{c}}_{0} N^{c_{0}})$ for some $c_{0}, {\bar{c}}_{0} > 0$ using more delicate concentration tools.

We now provide an important remark pertaining to (c): as we show in the proof, provided N grows at least polynomially in d, with probability $1 - \exp (- C' N^{1 / 4})$ over X_i, it holds that, for any W having a small risk, $\hat{L} (W)$ , W^TW is close to ${(W^{*})}^{T} W^{*}$ : ${‖ W^{T} W - {(W^{*})}^{T} W^{*} ‖}_{F}$ is small. Consequently, $L (W)$ is small. This is one of the additional technical results of our paper and is achieved by controlling the condition number of a certain matrix whose i.i.d. rows consist of the tensorized data $X_{i}^{\otimes 2}$ . The proof uses a recent work analyzing the spectrum of expected covariance matrices of tensorized data (Emschwiller et al. [33, theorem 5.1]). It is worth highlighting that the independence of the coordinates of X_i is crucial for establishing Theorem 6. Our results leverage techniques in Emschwiller et al. [33] and require, in particular, that the coordinates of X_i are samples from an admissible distribution in the sense of Emschwiller et al. [33, definition 2.4, theorem 5.1]; they would no longer be valid if X_i had potentially dependent coordinates.

These results concern the performance of gradient descent assuming the initialization is proper, that is, it is below the aforementioned energy barrier. One can then naturally ask whether such an initialization is indeed possible in some generic context. In the next section, we address this question of proper initialization when the (planted) weights are generated randomly in order to complement Theorems 5 and 6. We establish that such a proper initialization is indeed possible by providing a deterministic initialization guarantee, which, with high probability, beats the aforementioned energy barrier.

Remark 1.

Before we close this section, we remark on the full-rank assumption, $rank (W^{*}) = d$ . This assumption can be relaxed for Theorems 1 and 2, and one can establish a similar energy barrier for all W with $rank (W) < rank (W^{*})$ . For instance, suppose $rank (W^{*}) = d - 1$ and $rank (W) \leq d - 2$ . Using an eigenvalue interlacing argument (see, e.g., Fulton [35, equation 11]), one can subsequently modify the proofs of Theorems 1 and 2 and obtain a similar energy barrier for all W whose rank is strictly below that of planted $W^{*}$ . On the other hand, the full-rank assumption appears necessary for the results regarding stationary points and global optimality (Theorems 3 and 4) and the subsequent convergence guarantees regarding the gradient descent (Theorems 5 and 6). In fact, a similar full-rank assumption to rule out spurious local minima is also employed in the very related work Soltanolkotabi et al. [83, theorem 2.2] discussed earlier. One of their results (Soltanolkotabi et al. [83, theorem 2.1]) seems to avoid this though at the expense of a rather unnatural assumption that the output layer contains at least d positive and d negative entries. Inspecting their proof as well as ours, a full-rank assumption or the assumption that the output layer has d positive and d negative entries seem crucial: to study the gradient of a stationary point and subsequently declare global optimality, it appears necessary to show that the kernel of a certain matrix arising in the gradient calculation is trivial. It would be interesting to see to which extent this assumption can be relaxed or whether it can be avoided altogether; we leave this as an interesting open direction for future work.

2.2. On Initialization: Randomly Generated Planted Weights

Our results in the previous section show that, provided the initialization of the gradient descent method occurs below the critical energy, the algorithm converges to the global minimum. This raises the question whether such an initialization can be found in a constructive way.

In this section, we show that the answer is yes in the setting of randomly generated weights of the ground truth matrix $W^{*}$ . Specifically, we provide a way to properly initialize such networks under the assumption that the (planted) weight matrix $W^{*} \in R^{m \times d}$ has arbitrary i.i.d. centered entries with unit variance and a finite fourth moment and the data has centered i.i.d. sub-Gaussian coordinates. (It is worth mentioning that, similar to before, the sub-Gaussianity assumption on the data are required only for the case of empirical risk, and the corresponding population risk result holds under a milder distributional assumption; see Theorem 7.) Our result is valid provided that the network is sufficiently overparameterized: $m > C d^{2}$ for some large constant C. Note that this implies $W^{*}$ is a tall matrix sending $R^{d}$ into $R^{m}$ . The rationale behind this approach is as follows. The value of the risk is determined by the spectrum of $Δ ≜ W^{T} W - {(W^{*})}^{T} W^{*}$ and the moments of the data distribution. Under our randomness assumption, the so-called Wishart matrix ${(W^{*})}^{T} W^{*}$ is tightly concentrated around a multiple of the identity if m is sufficiently large. Hence, one can control the spectrum of Δ and, therefore, the loss functions ( $L$ and $\hat{L} (\cdot)$ ) by properly choosing the initialization W.

We now state the main results of this section, starting with the population risk version.

Theorem 7.

Suppose that the data $X \in R^{d}$ consists of i.i.d. centered coordinates with $Var (X_{i}^{2}) > 0$ and $E [X_{i}^{4}] < \infty$ . Recall $L (W)$ from (3).

(Gaussian case) Suppose that the planted weight matrix $W^{*} \in R^{m \times d}$ has i.i.d. standard normal entries. Let the initial weight matrix $W_{0} \in R^{m \times d}$ be defined by ${(W_{0})}_{i, i} = \sqrt{m + 4 d}$ for $1 \leq i \leq d$ and ${(W_{0})}_{i, j} = 0$ otherwise; that is, $W_{0}^{T} W_{0} = γ I_{d}$ with $γ = m + 4 d$ . Then, provided $m > C d^{2}$ for a sufficiently large absolute constant C > 0,
$L (W_{0}) < \min_{W \in R^{m \times d} : rank (W) < d} L (W),$
with probability at least $1 - \exp (- Ω (d))$ , where the probability is with respect to the draw of $W^{*}$ .
(General case) Suppose the planted weight matrix $W^{*} \in R^{m \times d}$ has centered i.i.d. entries with unit variance and a finite fourth moment. Let the initial weight matrix $W_{0} \in R^{m \times d}$ be defined by ${(W_{0})}_{i, i} = \sqrt{m}$ for $1 \leq i \leq d$ , and ${(W_{0})}_{i, j} = 0$ otherwise; that is, $W_{0}^{T} W_{0} = m I_{d}$ . Then, provided $m > C d^{2}$ for a sufficiently large absolute constant C > 0,
$L (W_{0}) < \min_{W \in R^{m \times d} : rank (W) < d} L (W),$
with high probability as $d \to \infty$ , where the probability is with respect to the draw of $W^{*}$ .

The proof of this theorem is provided in Section 4.8.

Note that, part (a) of Theorem 7 gives an explicit rate for probability in the case when the i.i.d. entries of the planted weight matrix $W^{*}$ are standard normal and is based on a nonasymptotic concentration result for the spectrum of such matrices. The extension in part (b) is based on a semicircle law obtained by Bai and Yin [3].

The corresponding result for the empirical risk is provided as follows.

Theorem 8.

Suppose that the planted weight matrix $W^{*} \in R^{m \times d}$ has centered i.i.d. entries with unit variance and a finite fourth moment; the (i.i.d.) data $X_{i} \in R^{d}, 1 \leq i \leq N$ , has i.i.d. centered sub-Gaussian coordinates (namely, for some C > 0, $P (| X_{i} (j) | > t) \leq \exp (- C t^{2})$ for any t > 0, $1 \leq i \leq N$ and $1 \leq j \leq d$ ) and the $W_{0} \in R^{m \times d}$ satisfies ${(W_{0})}_{i i} = \sqrt{m}$ for $i \in [d]$ and ${(W_{0})}_{i j} = 0$ for $i \neq j$ , that is, $W_{0}^{T} W_{0} = m I_{d} \in R^{d \times d}$ . Then, for some absolute constants $C, C' > 0$ with probability at least

1 - \exp (- C' \frac{N}{d^{5} m}) - N d \exp (- C d) - o_{d} (1),

it is the case that, for the constant C₅ defined in Theorem 2,

\hat{L} (W_{0}) < \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4}

provided

m > C^{″} d^{2}

for a sufficiently large constant

C^{″} > 0

The proof of Theorem 8 is provided in Section 4.12.

It is worth noting that, unlike the earlier Theorems 2 and 6, Theorems 7 and 8 do not have a separate statement for the case of finite d ( $d = O (1)$ ): in order to ensure the concentration property for the Wishart ensemble takes place, one should consider the regime $d \to \infty$ . That is, our initialization results do not hold for the regime in which d is constant.

We highlight that, for low-rank models, such as the problem of sensing a low-rank matrix described earlier as well as learning quadratic networks with a low-rank weight matrix $W^{*}$ , prior work proposes spectral initialization (Gunasekar et al. [41], Khodak et al. [52], Li et al. [57], Stöger and Soltanolkotabi [87], Vodrahalli et al. [92]) and analyzes the performance of gradient descent in recovering the planted weights. More concretely, for recovering an unknown rank-r matrix $A^{*} \in R^{d \times d}$ , these works propose initializations of the form UU^T, where $U \in R^{d \times r}$ . For our results, in particular, the global optimality of certain stationary points (Theorems 3 and 4) as well as the convergence guarantees for the gradient descent (Theorems 5 and 6), the fact that $W^{*}$ is full rank is crucial. Even though such a low-rank initialization would not suffice for the goal of finding a W₀ with loss below the energy barrier, it would be interesting to analyze the performance of such an initialization for a higher number of iterations. If successful, such an initialization would enable us to relax the randomness assumption on $W^{*}$ ; we leave this as an interesting future direction.

We now turn our attention to the number of training samples required to learn such models.

2.3. Critical Number of Training Samples

In this section, we explore the smallest number of training data for which a small empirical risk controls the generalization error.

2.3.1. A Necessary and Sufficient Geometrical Condition.

We start by identifying a necessary and sufficient geometrical condition on the data under which any minimizer of the empirical risk has zero generalization error. For our setting, any such minimizer necessarily interpolates the data.

Theorem 9.

Let $X_{1}, \dots, X_{n} \in R^{d}$ be a set of data. Let (P) be the property that $span {X_{i} X_{i}^{T} : 1 \leq i \leq N} = S$ , where $S$ is the set of all d × d symmetric real matrices.

Suppose that (P) holds. Then, for any $\hat{m} \in N$ and $W \in R^{\hat{m} \times d}$ satisfying $f (W^{*}; X_{i}) = f (W; X_{i}), \forall i \in [N]$ , we have $W^{T} W = {(W^{*})}^{T} W^{*}$ . In particular, if $\hat{m} \geq m$ , then $W = Q W^{*}$ for some $Q \in R^{\hat{m} \times m}$ with $Q^{T} Q = I_{m}$ .
Suppose that (P) does not hold. Then, for any $W^{*} \in R^{m \times d}$ with $rank (W^{*}) = d$ and any $\hat{m} \geq d$ , there exists a $W \in R^{\hat{m} \times d}$ such that, whereas $f (W^{*}; X_{i}) = f (W; X_{i}), \forall i \in [N], W^{T} W \neq {(W^{*})}^{T} W^{*}$ . In particular, almost surely $L (W) > 0$ with respect to any jointly continuous distribution on $R^{d}$ .

The proof of Theorem 9 is provided in Section 4.13.

Note that the property (P), $span {X_{i} X_{i}^{T} : i \in [N]} = S$ , is a purely geometrical necessary and sufficient condition; it can be checked ahead of the optimization (not retrospective). Under (P), any minimizer of empirical risk $\hat{L}$ has perfect generalization. Conversely, when (P) does not hold, there are global optima of $\hat{L}$ that (almost surely) have a strictly positive generalization error. Furthermore, Theorem 9 applies to any arbitrary data set X_i (not necessarily random). We highlight the extension that, when $\hat{L} (W)$ is positive though small, one can control ${‖ W^{T} W - {(W^{*})}^{T} W^{*} ‖}_{F}$ via Theorem 5(c) and bound $L (W)$ . For a more refined version of Theorem 9, see Theorem 11.

We next highlight the parameter $\hat{m} \in N$ . Note that, when (P) holds, any network with $\hat{m}$ nodes interpolating the data necessarily has zero generalization error. This holds even in the overparameterized regime, $\hat{m} \geq m$ . Namely, once the data are interpolated, overparameterization does not hurt the generalization ability. This is an instance of an empirically observed (and extensively studied) phenomenon regarding neural networks that defies classical statistical wisdom.

Finally, Theorem 9 still holds if each node has an associated positive output weight $a_{j}^{*} \in R^{+}$ , that is if the network computes $\sum_{j \leq m} a_{j}^{*} {〈 W_{j}^{*}, X 〉}^{2}$ . This is easily seen by pushing $a_{j}^{*}$ inside $W_{j}^{*}$ .

2.3.2. Randomized Data Enjoys the Geometric Condition.

We now identify the smallest number $N^{*}$ of training samples such that random data $X_{i}, i \in [N]$ almost surely satisfies (P) for $N \geq N^{*}$ .

Theorem 10.

Let $N^{*} = d (d + 1) / 2$ and $X_{1}, \dots, X_{N} \in R^{d}$ be i.i.d. random vectors with a jointly continuous distribution. Then,

If $N \geq N^{*}$ , then $P (span (X_{i} X_{i}^{T} : i \in [N]) = S) = 1$ .
If $N < N^{*}$ , then for arbitrary $Z_{1}, \dots, Z_{N} \in R^{d}, span (Z_{i} Z_{i}^{T} : i \in [N]) ⊊ S$ .

Theorem 10 is likely folklore; we provide a proof in Section 4.14 for completeness. Observe that part (b) is straightforward: as $N^{*} = (\begin{matrix} d \\ 2 \end{matrix}) + d = dim (S)$ , (P) cannot hold when $N < N^{*}$ .

2.3.3. Sample Complexity Bound for the Planted Network Model.

Combining Theorems 9 and 10, we arrive at the following sample complexity result.

Theorem 11.

Let $X_{i}, i \in [N]$ be i.i.d. samples drawn from a jointly continuous distribution on $R^{d}$ and $Y_{i} = f (W^{*}; X_{i})$ , where $W^{*} \in R^{m \times d}$ with $rank (W^{*}) = d$ .

Suppose $N \geq N^{*}$ , and $\hat{m} \in N$ . Then, with probability one over $X_{1}, \dots, X_{N}$ , the following holds. If $W \in R^{\hat{m} \times d}$ with $f (W; X_{i}) = Y_{i}, \forall i \in [N]$ , then $f (W; x) = f (W^{*}; x)$ for all $x \in R^{d}$ .
Suppose $X_{i} \in R^{d}, i \in [N]$ are i.i.d., each with centered i.i.d. entries having variance μ₂ and a finite fourth moment μ₄. Suppose $N < N^{*}$ . Then, there exists a $W \in R^{m \times d}$ such that $f (W; X_{i}) = Y_{i}, \forall i \in [N]$ , but $L (W) \geq \min {μ_{4} - μ_{2}^{2}, 2 μ_{2}^{2}} \cdot σ_{min} {(W^{*})}^{4}$ .

The proof of Theorem 11 is provided in Section 4.15.

The lower bound in part (b) is very similar to the energy barrier for rank-deficient matrices per Theorems 1(a) and 2. Moreover, the interpolating network in (a) can be overparameterized. Theorem 11 provides the necessary and sufficient number of data points for training a shallow quadratic network so as to guarantee perfect generalization property.

2.3.4. Related Work.

We now discuss a related line of work on polynomial activations that seek necessary and sufficient conditions and/or the smallest sample size under which the loss functions satisfy various favorable properties. Venturi et al. [90] study a key topological property of the loss function, namely, the presence/absence of spurious valleys. Similar to us, they identify necessary and sufficient conditions independent of the data distribution. One of their results (Venturi et al. [90, corollary 9]) shows that the empirical risk $\hat{L}$ admits no spurious valleys provided the network is sufficiently overparameterized: $m \geq N$ , where N is the sample size. (A close variant of this result is in fact shown earlier by Livni et al. [59]; see Venturi et al. [90] for a comparison.) Moreover, they also establish a similar guarantee regarding the population risk for quadratic networks; see Venturi et al. [90, theorem 12]. Similar analysis for quadratic networks is conducted also by Soltanolkotabi et al. [83] and Du et al. [24] mentioned earlier. In particular, Soltanolkotabi et al. [83] establish the absence of spurious local minima for $\hat{L}$ when $d \leq N \leq O (d^{2})$ ; Du et al. [24] establish that a regularized version of the empirical risk has no spurious minima when (a) m > d or (b) $m (m + 1) \geq 2 n$ and m < d. Yet another related work by Mannelli et al. [60] show the following when the data has i.i.d. standard normal entries. For $n > 2 d$ , the only minimizer of the empirical risk is a planted weight matrix, whereas for $n < 2 d$ , the empirical risk admits spurious local minima, both w.h.p. as $n, d \to \infty$ .

3. Preliminaries

We collect herein several useful auxiliary results that we employ in our proofs.

3.1. An Analytical Expression for the Population Risk

Toward proving our energy barrier results, Theorems 2 and 1, we start with providing an analytical expression for the population risk $L (W)$ of any $W \in R^{m \times d}$ in terms of how close it is to the planted weight matrix $W^{*} \in R^{m \times d}$ .

We recall that a random vector X in $R^{d}$ is defined to have a jointly continuous distribution if there exists a measurable function $f : R^{d} \to R$ such that, for any $i \in [N]$ and Borel set $B \subseteq R^{d}$ ,

P (X \in B) = \int_{B} f (x_{1}, \dots, x_{d}) d λ (x_{1}, \dots, x_{d}),

where λ is the Lebesgue measure on

R^{d}

Theorem 12.

Let $W^{*} \in R^{m \times d}, f (W^{*}; X)$ be the function computed by (1) and $f (W; X)$ be similarly the function computed by (1) for $W \in R^{m \times d}$ . Recall,

L (W) = E [{(f (W^{*}; X) - f (W; X))}^{2}],

where the expectation is with respect to the distribution of

X \in R^{d}

Suppose the distribution of X is jointly continuous. Then, $L (W) = 0$ , that is, $f (W^{*}; X) = f (W; X)$ almost surely with respect to X if and only if $W = Q W^{*}$ for some orthonormal matrix $Q \in R^{m \times m}$ . Suppose now that the coordinates of $X \in R^{d}$ are i.i.d. with $E [X_{i}] = 0, E [X_{i}^{2}] = μ_{2}$ , and $E [X_{i}^{4}] = μ_{4}$ .
It holds that
$L (W) = μ_{2}^{2} \cdot trace {(A)}^{2} + 2 μ_{2}^{2} \cdot trace (A^{2}) + (μ_{4} - 3 μ_{2}^{2}) \cdot trace (A \circ A),$
where $A = {(W^{*})}^{T} W^{*} - W^{T} W \in R^{d \times d}$ and $A \circ A$ is the Hadamard product of A with itself. In particular, if $X \in R^{d}$ has i.i.d. standard normal coordinates, we obtain $L (W) = trace {(A)}^{2} + 2 trace (A^{2})$ .
The following bounds hold:
$μ_{2}^{2} \cdot trace {(A)}^{2} + \min {μ_{4} - μ_{2}^{2}, 2 μ_{2}^{2}} \cdot trace (A^{2}) \leq L (W),$
and
$μ_{2}^{2} \cdot trace {(A)}^{2} + \max {μ_{4} - μ_{2}^{2}, 2 μ_{2}^{2}} \cdot trace (A^{2}) \geq L (W) .$

The proof of Theorem 12 is provideed in Section 4.1.

In a nutshell, Theorem 12 states that the population risk $L (W)$ of any $W \in R^{d}$ is completely determined by how close it is to the planted weights $W^{*}$ as measured by the matrix $A = {(W^{*})}^{T} W^{*} - W^{T} W$ and the second and fourth moments of the data. This is not surprising: $L (W)$ is essentially a function of the first four moments of the data and the difference of the quadratic forms generated by W and $W^{*}$ , which is precisely encapsulated by the matrix A. Note also that the characterization of the optimal orbit per part (a) is not surprising either: any matrix W with the property $W = Q W^{*}$ , where $Q \in R^{m \times m}$ is an orthonormal matrix, that is, $Q^{T} Q = I_{m}$ , has the property that $f (W; X) = {‖ W X ‖}_{2}^{2} = X^{T} W^{T} W X = f (W^{*}; X)$ for any data $X \in R^{d}$ . Part (a) then says the reverse is true as well provided that the distribution of X is jointly continuous. Note also that, for X with centered i.i.d. entries, the thesis of part (a) follows also from part (c): $L (W) = 0$ implies that $trace (A^{2}) = 0$ , which, together with the fact that A is symmetric, then yields A = 0; that is, $W^{T} W = {(W^{*})}^{T} W^{*}$ .

3.2. Useful Lemmas and Results from Linear Algebra and Random Matrix Theory

Our next result is a simple norm bound for the ensemble $X_{i} \in R^{d}, 1 \leq i \leq N$ with sub-Gaussian coordinates.

Lemma 1.

Let $X_{i} \in R^{d}, 1 \leq i \leq N$ be an i.i.d. collection of random vectors with centered i.i.d. sub-Gaussian coordinates, that is, for some constant C > 0, $P (| X_{i} (j) | > t) \leq \exp (- C t^{2})$ for every $i \in [N], j \in [d]$ , and $t \geq 0$ . Then,

P ({‖ X_{i} ‖}_{\infty} \leq d^{K_{1}}, 1 \leq i \leq N) \geq 1 - N d \exp (- C d^{2 K_{1}}) .

The proof of Lemma 1 is provided in Section 4.3.

Our energy barrier result Theorem 2 for the empirical risk is proven by establishing the emergence of a barrier for a single rank-deficient $A \in R^{d \times d}$ together with a covering numbers argument.

Lemma 2.

Let $K_{1} > 0$ be an arbitrary constant and $X_{i} \in R^{d}, 1 \leq i \leq N$ be a collection of i.i.d. data with centered i.i.d. sub-Gaussian coordinates for which for any M > 0, the mean of $| X_{1} (1) |$ conditional on $| X_{1} (1) | \leq M$ is zero, and let $Y_{i} = f (W^{*}; X_{i})$ be the corresponding label generated by a neural network with planted weights $W^{*} \in R^{m \times d}$ as per (1), where ${‖ W^{*} ‖}_{F} \leq d^{K_{2}}$ . Fix any $A \in R^{d \times d}$ , where ${‖ A ‖}_{F} \leq d^{2 K_{2}}, rank (A) \leq d - 1$ , and $A ⪰ 0$ . Define the event

E (A) ≜ {\frac{1}{N} \sum_{1 \leq i \leq N} {(Y_{i} - X_{i}^{T} A X_{i})}^{2} \geq \frac{1}{2} C_{5} (K_{1}) σ_{min} {(W^{*})}^{4}},

where

C_{5} (K_{1}) ≜ \min {μ_{4} (K_{1}) - μ_{2} {(K_{1})}^{2}, 2 μ_{2} {(K_{1})}^{2}}

for

μ_{n} (K) = E [X_{1} {(1)}^{n} | | X_{1} (1) | \leq d^{K}]

. Then, there exists a constant

C' > 0

(independent of W and depending only on data distribution, K₁, and

W^{*}

) such that

P (E {(A)}^{c} | {‖ X_{i} ‖}_{\infty} \leq d^{K_{1}}, 1 \leq i \leq N) \leq \exp (- C_{3} \frac{N}{d^{4 K_{1} + 4 K_{2} + 2}}) .

In particular,

P (E (A)) \geq 1 - \exp (- C_{3} \frac{N}{d^{4 K_{1} + 4 K_{2} + 2}}) - N d e^{- C d^{2 K_{1}}},

where C > 0 is the same constant as in Lemma 1.

The parameter K₁ appearing in Lemma 2 controls the amount of truncation applied on training data, and K₂ controls the norm of the planted weight matrix. The proof of Lemma 2 is provided in Section 4.4.

The next result is a covering number bound adopted from Candès and Plan [14, lemma 3.1] with minor modifications.

Lemma 3.

Let

S_{R} ≜ {A \in R^{d \times d} : rank (A) \leq r, A ⪰ 0, {‖ A ‖}_{F} \leq R} .

Then, there exists a $ϵ -$ net $\bar{S_{R}}$ for S_R in the Frobenius norm (that is, for every $A \in S_{R}$ , there exists an $\hat{A} \in \bar{S_{R}}$ such that ${‖ A - \hat{A} ‖}_{F} \leq ϵ$ ) such that

| \bar{S_{R}} | \leq {(\frac{9 R}{ϵ})}^{d r + r} .

The proof of Lemma 3 is provided in Section 4.5.

Some of our results use the following well-known results. These results are verbatim from the literature and provided herein without proof.

Theorem 13

(Caron and Traynor [16]). Let $ℓ$ be an arbitrary positive integer and $P : R^{ℓ} \to R$ be a polynomial. Then, either P is identically zero or ${x \in R^{ℓ} : P (x) = 0}$ has zero Lebesgue measure, namely, P(x) is nonzero almost everywhere.

Theorem 14

(Horn and Johnson [47, Theorem 7.3.11]). For two matrices $A \in R^{p \times n}$ and $B \in R^{q \times n}$ , where $q \leq p$ , $A^{T} A = B^{T} B$ holds if and only if A = QB for some matrix $Q \in R^{p \times q}$ with orthonormal columns.

Our results regarding the initialization guarantees use the several auxiliary results from random matrix theory: the spectrum of tall random matrices are essentially concentrated.

Theorem 15

(Vershynin [91, Corollary 5.35]). Let A be an m × d matrix with independent standard normal entries. For every $t \geq 0$ , with probability at least $1 - 2 \exp (- t^{2} / 2)$ , we have

\sqrt{m} - \sqrt{d} - t \leq σ_{min} (A) \leq σ_{max} (A) \leq \sqrt{m} + \sqrt{d} + t .

Theorem 16

(Bai and Yin [4], Vershynin [91, Theorem 5.31]). Let $A = A_{N, n}$ be an N × n random matrix whose entries are independent copies of a random variable with zero mean, unit variance, and a finite fourth moment. Suppose that the dimensions N and n grow to infinity and the aspect ratio n/N converges to a constant in $[0, 1]$ . Then,

σ_{min} (A) = \sqrt{N} - \sqrt{n} + o (\sqrt{n}), and σ_{min} (A) = \sqrt{N} + \sqrt{n} + o (\sqrt{n}),

almost surely.

The following concentration result, recorded herein verbatim from Vershynin [91], is useful for our approximate stationarity analysis.

Theorem 17

(Vershynin [91, Theorem 5.44]). Let A be an N × n matrix whose rows A_i are independent random vectors in $R^{n}$ with the common second moment matrix $Σ = E [A_{i} A_{i}^{T}]$ . Let m be a number such that ${‖ A_{i} ‖}_{2} \leq \sqrt{m}$ almost surely for all i. Then, for every $t \geq 0$ , the following inequality holds with probability at least $1 - n \cdot \exp (- c t^{2})$ :

‖ \frac{1}{N} A^{T} A - Σ ‖ \leq \max ({‖ Σ ‖}^{1 / 2} δ, δ^{2}) where δ = t \sqrt{m / N} .

Here, c > 0 is an absolute constant.

Finally, we make use of the matrix-operator version of Hölder’s inequality.

Theorem 18

(Bhatia [8]). For any matrix $U \in R^{k \times ℓ}$ , let ${‖ U ‖}_{σ_{p}}$ denotes the $ℓ_{p}$ norm of the vector

(σ_{1} (U), \dots, σ_{\min {k, ℓ}} (U))

of singular values of U. Then, for any

p, q > 0

with

\frac{1}{p} + \frac{1}{q} = 1

, it holds that

| 〈 U, V 〉 | = | trace (U^{T} V) | \leq {‖ U ‖}_{σ_{p}} {‖ V ‖}_{σ_{q}} .

4. Proofs

In this section, we present the proofs of the main results of this paper.

The order of the proofs presented herein is slightly different from the order of the corresponding results in the main body in that none of the following proofs (with one exception that we detail) use a proof presented later than itself. That is, whenever we present the proof of a result, it is ensured that, if this proof requires another result as a building block, this building block is shown earlier. The rationale behind this is to avoid any potential confusion and to ensure that no cyclic reasoning is present.

With this arrangement, only Theorem 4 uses results presented later in this section (more precisely, it uses Theorems 9 and 10), and it can be checked directly that there is no cyclic reasoning in the proof of Theorem 4.

4.1. Proof of Theorem 12

First, we have

f (W; X) - f (W^{*}; X) = X^{T} ({(W^{*})}^{T} W^{*} - W^{T} W) X ≜ X^{T} A X,

(5)

where

A = {(W^{*})}^{T} W^{*} - W^{T} W \in R^{d \times d}

is a symmetric matrix. Note also that

trace {(A)}^{2} = \sum_{i = 1}^{d} A_{i i}^{2} + 2 \sum_{i < j} A_{i i} A_{j j},

(6)

and

trace (A^{2}) = trace (A^{T} A) = {‖ A ‖}_{F}^{2} = \sum_{i, j} A_{i j}^{2} = \sum_{i = 1}^{d} A_{i i}^{2} + 2 \sum_{i < j} A_{i j}^{2},

(7)

where A² is equal to

A^{T} A

as A is symmetric.

Recall Theorem 13. In particular, if $L (W) = 0$ , then we have $P (X) = X^{T} A X = 0$ almost surely. Because $P (\cdot) : R^{d} \to R$ a polynomial, it then follows that P(X) = 0 identically. Now, because A is symmetric, it has real eigenvalues, called $λ_{1}, \dots, λ_{d}$ with corresponding (real) eigenvectors $ξ_{1}, \dots, ξ_{d}$ . Now, taking $X = ξ_{i}$ , we have $X^{T} A X = ξ_{i}^{T} A ξ = λ_{i} 〈 ξ_{i}, ξ_{i} 〉 = 0$ . Because $ξ_{i} \neq 0$ , we get $λ_{i} = 0$ for any i. Finally, because $A = Q Λ Q^{T}$ , it must necessarily be the case that A = 0. Hence, $W^{T} W = {(W^{*})}^{T} W^{*}$ , which implies $W = Q W^{*}$ for some $Q \in R^{m \times m}$ orthonormal, per Theorem 14.
Using Equation (5), we first have
$L (W) = \sum_{1 \leq i, j, i', j' \leq d} A_{i j} A_{i', j'} E [X_{i} X_{j} X_{i'} X_{j'}] .$
Note that, if $| {i, j, i', j'} | \in {3, 4}$ , then $E [X_{i} X_{j} X_{i'} X_{j'}] = 0$ because X has centered i.i.d. coordinates. Keeping this in mind and carrying out the algebra, we then get
$\begin{array}{l} L (W) & = \sum_{i = 1}^{d} A_{i i}^{2} E [X_{i}^{4}] + 2 \sum_{i < j} A_{i i} A_{j j} E [X_{i}^{2}] E [X_{j}^{2}] + 4 \sum_{i < j} A_{i j}^{2} E [X_{i}^{2}] E [X_{j}^{2}] \\ = μ_{4} \sum_{i = 1}^{d} A_{i i}^{2} + 2 μ_{2}^{2} \sum_{i < j} A_{i i} A_{j j} + 4 μ_{2}^{2} \sum_{i < j} A_{i j}^{2} . \end{array}$
Using now Equations (6) and (7), we get
$L (W) = (μ_{4} - 3 μ_{2}^{2}) \cdot trace (A \circ A) + μ_{2}^{2} \cdot trace {(A)}^{2} + 2 μ_{2}^{2} \cdot trace (A^{2}),$
because $A_{i i}^{2} = {(A \circ A)}_{i i}$ .
Define k to be such that $μ_{4} - μ_{2}^{2} = 2 k μ_{2}^{2}$ , namely, k is related to measures of dispersion pertaining to X_i: $\sqrt{2 k}$ is the coefficient of variation, and $(2 k + 1)$ is the kurtosis associated to the random variable X_i. With this, we have
$L (W) = μ_{2}^{2} \cdot trace {(A)}^{2} + 2 μ_{2}^{2} (k \sum_{i = 1}^{d} A_{i i}^{2} + 2 \sum_{i < j} A_{i j}^{2}) .$

From here, the desired conclusion follows because

μ_{2}^{2} \cdot trace {(A)}^{2} + 2 \min {k, 1} μ_{2}^{2} (\sum_{i = 1}^{d} A_{i i}^{2} + 2 \sum_{i < j} A_{i j}^{2}) \leq L (W),

and

μ_{2}^{2} \cdot trace {(A)}^{2} + 2 \max {k, 1} μ_{2}^{2} (\sum_{i = 1}^{d} A_{i i}^{2} + 2 \sum_{i < j} A_{i j}^{2}) \geq L (W),

together with Equation (7).

4.2. Proof of Theorem 1

Note first that using Theorem 12 part (c), we have
$L (W) \geq \min {Var (X_{i}^{2}), 2 E {[X_{i}^{2}]}^{2}} trace (A^{2}) .$
Now, fix any $W \in R^{m \times d}$ with $rank (W) < d$ . Let $a_{1} \geq \dots \geq a_{d}$ be the eigenvalues of ${(W^{*})}^{T} W^{*}$ , $b_{1} \geq \dots \geq b_{d}$ be the eigenvalues of $- W^{T} W$ , and $λ_{1} \geq \dots \geq λ_{d}$ be the eigenvalues of ${(W^{*})}^{T} W^{*} - W^{T} W$ . Because W is rank deficient, we have $b_{1} = 0$ . Furthermore, $a_{d} = σ_{min} {(W^{*})}^{2}$ because the eigenvalues of ${(W^{*})}^{T} W^{*}$ are precisely the squares of the singular values of $W^{*}$ . Now, recall the (Courant–Fischer) variational characterization of the eigenvalues (Horn and Johnson [47]). If M is a d × d matrix with eigenvalues $c_{1} \geq \dots \geq c_{d}$ , then
$c_{1} = \max_{x : {‖ x ‖}_{2} = 1} x^{T} M x and c_{d} = \min_{x : {‖ x ‖}_{2} = 1} x^{T} M x .$
With this, fix an $x \in R^{d}$ with ${‖ x ‖}_{2} = 1$ . Then,
$x^{T} ({(W^{*})}^{T} W^{*} - W^{T} W) x \geq \min_{x : {‖ x ‖}_{2} = 1} x^{T} {(W^{*})}^{T} W^{*} x + x^{T} (- W^{T} W) x = a_{d} + x^{T} (- W^{T} W) x .$
Because this inequality holds for every x with ${‖ x ‖}_{2} = 1$ , we can take the max over all x and arrive at
$λ_{1} = \max_{x : {‖ x ‖}_{2} = 1} x^{T} ({(W^{*})}^{T} W^{*} - W^{T} W) x \geq a_{d} + b_{1} = a_{d} \geq σ_{min} {(W^{*})}^{2} .$
Now, because $λ_{1}^{2}, \dots, λ_{d}^{2}$ are precisely the eigenvalues of A², we have $trace (A^{2}) = \sum_{i = 1}^{d} λ_{i}^{2} \geq λ_{1}^{2}$ . Hence, for any W with $rank (W) < d$ , it holds that
$L (W) \geq \min {Var (X_{i}^{2}), 2 E {[X_{i}^{2}]}^{2}} λ_{1}^{2} .$
Finally, because $λ_{1}^{2} \geq σ_{min} {(W^{*})}^{4}$ , the desired conclusion follows by taking the minimum over all rank-deficient W.
Let the eigenvalues of ${(W^{*})}^{T} W^{*}$ be denoted by $λ_{1}^{*}, \dots, λ_{d}^{*}$ with the corresponding orthogonal eigenvectors $q_{1}^{*}, \dots, q_{d}^{*}$ . Namely, diagonalize ${(W^{*})}^{T} W^{*}$ as $Q^{*} Λ^{*} {(Q^{*})}^{T}$ , where the columns of $Q^{*} \in R^{d \times d}$ are $q_{1}^{*}, \dots, q_{d}^{*}$ and $Λ^{*} \in R^{d \times d}$ is a diagonal matrix with ${(Λ^{*})}_{i, i} = λ_{i}^{*}$ for every $1 \leq i \leq d$ . Let
$\bar{W} = \sum_{j = 1}^{d - 1} \sqrt{λ_{j}^{*}} q_{j}^{*} {(q_{j}^{*})}^{T} \in R^{d \times d} .$

Observe that ${\bar{W}}^{T} \bar{W} = Q^{*} \bar{Λ} Q^{*}$ , where $\bar{Λ} \in R^{d \times d}$ is a diagonal matrix with ${(\bar{Λ})}_{i, i} = {(Λ^{*})}_{i, i}$ for every $1 \leq i \leq d - 1$ and ${(\bar{Λ})}_{d, d} = 0$ and that $rank (\bar{W}) = d - 1$ . Now, let $\bar{W_{1}}, \dots, \bar{W_{d}} \in R^{d}$ be the rows of $\bar{W}$ and fix a $j \in [d]$ such that $\bar{W_{j}} \neq 0$ .

Having constructed a $\bar{W} \in R^{d \times d}$ , we now prescribe $W \in R^{m \times d}$ as follows. For $1 \leq i \leq d, i \neq j$ , let $W_{i} = \bar{W_{i}}$ , where W_i is the ith row of W. Then, set $W_{j} = \frac{1}{2} \bar{W_{j}}$ , and for every $d + 1 \leq i \leq m$ , set $W_{i} = \frac{\sqrt{3}}{2 \sqrt{m - d}} \bar{W_{j}}$ . For this matrix, we now claim

W^{T} W = {\bar{W}}^{T} \bar{W} .

To see this, fix an $X \in R^{d}$ and recall that $X^{T} W^{T} W X - X^{T} {\bar{W}}^{T} \bar{W} X = {‖ W X ‖}_{2}^{2} - {‖ \bar{W} X ‖}_{2}^{2}$ . We now compute this quantity more explicitly:

\begin{array}{l} {‖ W X ‖}_{2}^{2} - {‖ \bar{W} X ‖}_{2}^{2} & = \sum_{k = 1}^{d} {〈 W_{k}, X 〉}^{2} - \sum_{k = 1}^{m} {〈 {\bar{W}}_{k}, X 〉}^{2} \\ = \sum_{k = 1, k \neq j}^{d} {〈 W_{k}, X 〉}^{2} + {〈 W_{j}, X 〉}^{2} \\ - \sum_{k = 1, k \neq j}^{d} {〈 W_{k}, X 〉}^{2} - {〈 \frac{1}{2} W_{j}, X 〉}^{2} - \sum_{k = d + 1}^{m} {〈 \frac{\sqrt{3}}{2 \sqrt{m - d}} W_{j}, X 〉}^{2} \\ = {〈 W_{j}, X 〉}^{2} - \frac{1}{4} {〈 W_{j}, X 〉}^{2} - \frac{3}{4 (m - d)} (m - d) {〈 W_{j}, X 〉}^{2} = 0 . \end{array}

Hence, for every $X \in R^{d}$ , we have

X^{T} W^{T} W X = X^{T} {\bar{W}}^{T} \bar{W} X .

Now, let $Ξ = W^{T} W - {\bar{W}}^{T} \bar{W}$ . Note that $Ξ \in R^{d \times d}$ is symmetric and $X^{T} Ξ X = 0$ for every $X \in R^{d}$ . Now, taking X to be e_i, that is, the ith element of the standard basis for the Euclidean space $R^{d}$ , we deduce $Ξ_{i, i} = 0$ for every $i \in [d]$ . For the off-diagonal entries, let $X = e_{i} + e_{j}$ . Then, $X^{T} Ξ X = Ξ_{i, i} + Ξ_{i, j} + Ξ_{j, i} + Ξ_{j, j} = 0$ , which, together with the fact that the diagonal entries of Ξ are zero, imply $Ξ_{i, j} = - Ξ_{j, i}$ , namely, Ξ is skew-symmetric. Finally, because Ξ is also symmetric, we have $Ξ_{i, j} = Ξ_{j, i}$ , which then implies for every $i, j \in [d], Ξ_{i, j} = 0$ , that is, Ξ = 0, and thus, $W^{T} W = {\bar{W}}^{T} \bar{W}$ .

Hence, we have for $W \in R^{m \times d}$ with $rank (W) = d - 1$ ,

W^{T} W - {(W^{*})}^{T} W^{*} = Q^{*} Λ' {(Q^{*})}^{T},

with

{(Λ')}_{i, i} = 0

for every

1 \leq i \leq d - 1

and

{(Λ')}_{d, d} = - λ_{d}^{*}

. Namely, the spectrum of the matrix

A = {(W^{*})}^{T} W^{*} - W^{T} W

contains only two values: zero with multiplicity d − 1 and

λ_{d}^{*}

with multiplicity one. In particular,

trace (A) = λ_{d}^{*} and trace (A^{2}) = {(λ_{d}^{*})}^{2} .

Using now the upper bound provided by Theorem 12 part (c) yields the desired claim. Therefore, the energy band lower bound is tight up to a multiplicative constant.

4.3. Proof of Lemma 1

For any fixed $i \in [N], j \in [d]$ , note that using sub-Gaussian property one has $P (| X_{i} (j) | > d^{K_{1}}) \leq \exp (- C d^{2 K_{1}})$ ; thus, $P (\exists i \in [N], j \in [d] : | X_{i} (j) | > d^{K_{1}}) \leq N d \exp (- C d^{2 K_{1}})$ , using union bound, which yields the conclusion.

4.4. Proof of Lemma 2

Let

E_{1} ≜ {{‖ X_{i} ‖}_{\infty} \leq d^{K_{1}}, 1 \leq i \leq N} .

By Lemma 1, $P (E_{1}) \geq 1 - N d \exp (- C d^{2 K_{1}})$ . Now, note that

P (E {(A)}^{c}) = P (E {(A)}^{c} | E_{1}) P (E_{1}) + P (E {(A)}^{c} | E_{1}^{c}) P (E_{1}^{c}) \leq P (E {(A)}^{c} | E_{1}) + N \exp (- C d^{2 K_{1}}) .

(8)

We now study $P (E {(A)}^{c} | E_{1})$ ; hence, assume we condition on $E_{1}$ from now on. The triangle inequality yields

| Y_{i} - X_{i}^{T} A X_{i} | \leq | X_{i}^{T} A X_{i} | + | X_{i}^{T} {(W^{*})}^{T} W^{*} X_{i} | .

Observe now that

{‖ X_{i} X_{i} ‖}_{F}^{2} = trace (X_{i} X_{i}^{T} X_{i} X_{i}^{T}) = {‖ X_{i} ‖}_{2}^{2} trace (X_{i} X_{i}^{T}) = {‖ X_{i} ‖}_{2}^{4},

which implies (conditional on

E_{1}

)

{‖ X_{i} X_{i}^{T} ‖}_{F} = {‖ X_{i} ‖}_{2}^{2} \leq d^{2 K_{1} + 1} .

Now, the Cauchy–Schwarz inequality with respect to the inner product $〈 U, V 〉 ≜ trace (U^{T} V)$ yields

| X_{i}^{T} A X_{i} | = 〈 A, X_{i} X_{i}^{T} 〉 \leq {‖ A ‖}_{F} {‖ X_{i} X_{i}^{T} ‖}_{F} \leq d^{2 K_{1} + 2 K_{2} + 1},

for every

i \in [N]

, using

{‖ A ‖}_{F} \leq d^{2 K_{2}}

Next, let $A^{*} = {(W^{*})}^{T} W^{*} \in R^{d \times d}$ , and let $η_{1}^{*}, \dots, η_{d}^{*}$ be the eigenvalues of $A^{*}$ , all nonnegative. Observe that

{‖ W^{*} ‖}_{F}^{2} = trace (A^{*}) = \sum_{1 \leq j \leq d} η_{j}^{*} \leq d^{2 K_{2}} .

Now, note that ${(η_{1}^{*})}^{2}, {(η_{2}^{*})}^{2}, \dots, {(η_{d}^{*})}^{2}$ are the eigenvalues of ${(A^{*})}^{2} = {(A^{*})}^{T} A^{*}$ . With this reasoning, we have

{‖ A^{*} ‖}_{F}^{2} = trace ({(A^{*})}^{T} A^{*}) = trace ({(A^{*})}^{2}) = \sum_{1 \leq j \leq d} {(η_{j}^{*})}^{2} \leq {(\sum_{1 \leq j \leq d} η_{j}^{*})}^{2} \leq d^{4 K_{2}} .

Consequently, ${‖ A^{*} ‖}_{F} \leq d^{2 K_{2}}$ , and therefore, the exact same reasoning yields

| X_{i}^{T} {(W^{*})}^{T} W^{*} X_{i} | = X_{i}^{T} A^{*} X_{i} \leq d^{2 K_{1} + 2 K_{2} + 1},

for every

i \in [N]

. Hence, conditional on

E_{1}

, it holds that, for every

i \in [N]

{(X_{i}^{T} A X_{i} - X_{i}^{T} {(W^{*})}^{T} W^{*} X_{i})}^{2} \leq 4 d^{4 K_{1} + 4 K_{2} + 2} .

We now apply concentration to i.i.d. sum

\frac{1}{N} \sum_{1 \leq i \leq N} {(X_{i}^{T} A X_{i} - X_{i}^{T} {(W^{*})}^{T} W^{*} X_{i})}^{2},

which is a sum of bounded random variables that are at most

4 d^{4 K_{1} + 4 K_{2} + 2}

Now, recalling the distributional assumption on the data, we have that, conditional on ${‖ X_{i} ‖}_{\infty} \leq d^{K_{1}}$ , the data still has i.i.d. centered coordinates. In particular, the energy barrier result for the population risk as per Theorem 1 applies:

E [{(X^{T} A X - X^{T} {(W^{*})}^{T} W^{*} X)}^{2} | E_{1}] \geq C_{5} (K_{1}) σ_{min} {(W^{*})}^{4},

where

C_{5} (K_{1}) = \min {μ_{4} (K_{1}) - μ_{2} {(K_{1})}^{2}, 2 μ_{2} {(K_{1})}^{2}},

is controlled by the conditional moments of data coordinates.

Finally applying Hoeffding’s inequality for bounded random variables, we arrive at

\frac{1}{N} \sum_{1 \leq i \leq N} {(X_{i}^{T} A X_{i} - X_{i}^{T} {(W^{*})}^{T} W^{*} X_{i})}^{2} \geq \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4},

with probability at least

1 - \exp (- C_{3} N d^{- 4 K_{1} - 4 K_{2} - 2})

. Namely,

P (E {(A)}^{c} | E_{1}) \leq \exp (- C_{3} N d^{- 4 K_{1} - 4 K_{2} - 2}) .

Returning to (8), this yields

P (E_{A}) \geq 1 - \exp (- C_{3} N d^{- 4 K_{1} - 4 K_{2} - 2}) - N d \exp (- C d^{2 K_{1}}) .

This completes the proof of Lemma 2.

4.5. Proof of Lemma 3

The proof is almost verbatim from Candès and Plan [14, lemma 3.1], and included herein for completeness.

Note that any $A \in R^{d \times d}, A ⪰ 0$ , and $rank (A) = r$ decomposes as $A = Q Λ Q^{T}$ , where $Q \in R^{d \times r}$ , satisfying $Q^{T} Q = I_{d}$ and $Λ \in R^{r \times r}$ , a diagonal matrix with nonnegative diagonal entries. Notice, furthermore, that ${‖ A ‖}_{F} = {‖ Λ ‖}_{F} \leq R$ as Q is orthonormal. With this, we now construct an appropriate net covering the set of all permissible Q and Σ.

Let D be the set of all r × r diagonal matrices with nonnegative diagonal entries with a Frobenius norm at most R. Let $\bar{D}$ be an $\frac{ϵ}{3} -$ net for D in the Frobenius norm. Using standard results (see, e.g., Vershynin [91, lemma 5.2]), we have

| \bar{D} | \leq {(\frac{9 R}{ϵ})}^{r} .

Now, let $O_{d, r} = {Q \in R^{d \times r} : Q^{T} Q = I_{d}}$ . To cover $O_{d, r}$ , we use a more convenient norm ${‖ \cdot ‖}_{1, 2}$ defined as

{‖ X ‖}_{1, 2} = \max_{i} {‖ X_{i} ‖}_{2},

where X_i is the

i th

column of X. Define

Q_{d, r} = {X \in R^{d \times r} : {‖ X ‖}_{1, 2} \leq 1}

. Note that

O_{d, r} \subset Q_{d, r}

. Furthermore, observe also that

Q_{d, r}

has a

ϵ -

net of cardinality at most

{(3 / ϵ)}^{d r}

. With this, we now take

{\bar{O}}_{d, r}

to be a

\frac{ϵ}{3 R} -

net for

O_{d, r}

. Consider now the set

\bar{S_{R}} ≜ {\bar{Q} \bar{Λ} {\bar{Q}}^{T} : \bar{Q} \in {\bar{O}}_{d, r}, \bar{Λ} \in \bar{D}} .

Clearly,

| \bar{S_{R}} | \leq | {\bar{O}}_{d, r} | | \bar{D} | \leq {(9 R / ϵ)}^{d r + r} .

We now claim $\bar{S_{R}}$ is indeed a $ϵ -$ net for S_R in the Frobenius norm. To prove this, take an arbitrary $A \in S_{R}$ , and let $A = Q Λ Q^{T}$ . There exists a $\bar{Q} \in {\bar{O}}_{d, r}$ and a $\bar{Σ} \in \bar{D}$ such that ${‖ Σ - \bar{Σ} ‖}_{F} \leq ϵ / 3$ , and ${‖ Q - \bar{Q} ‖}_{1, 2} \leq ϵ / 3 R$ . Now, let $\bar{A} = \bar{Q} \bar{Σ} {\bar{Q}}^{T}$ . Observe that, using the triangle inequality,

\begin{array}{l} {‖ \bar{A} - A ‖}_{F} & = {‖ Q Λ Q^{T} - \bar{Q} \bar{Λ} {\bar{Q}}^{T} ‖}_{F} \\ \leq {‖ Q Λ Q^{T} - \bar{Q} Λ Q^{T} ‖}_{F} + {‖ \bar{Q} Λ Q^{T} - \bar{Q} \bar{Λ} Q^{T} ‖}_{F} + {‖ \bar{Q} \bar{Λ} Q^{T} - \bar{Q} \bar{Λ} {\bar{Q}}^{T} ‖}_{F} . \end{array}

For the first term, note that, because Q is orthonormal, ${‖ (Q - \bar{Q}) Λ Q^{T} ‖}_{F} = {‖ (Q - \bar{Q}) Λ ‖}_{F}$ . Next,

{‖ (Q - \bar{Q}) Λ ‖}_{F}^{2} = \sum_{1 \leq i \leq d} Λ_{i i}^{2} {‖ Q_{i} - {\bar{Q}}_{i} ‖}_{2}^{2} \leq {‖ Q - \bar{Q} ‖}_{1, 2}^{2} {‖ Σ ‖}_{F}^{2} \leq {(ϵ / 3)}^{2},

using

{‖ Q - \bar{Q} ‖}_{1, 2} \leq ϵ / 3 R

and

{‖ Σ ‖}_{F} \leq R

. Thus,

{‖ Q Λ Q^{T} - \bar{Q} Λ Q^{T} ‖}_{F} \leq ϵ / 3

. Similarly, we also have

{‖ \bar{Q} \bar{Λ} Q^{T} - \bar{Q} \bar{Λ} {\bar{Q}}^{T} ‖}_{F} \leq ϵ / 3

. Finally,

{‖ \bar{Q} Λ Q^{T} - \bar{Q} \bar{Λ} Q^{T} ‖}_{F} = {‖ Λ Q^{T} - \bar{Λ} Q^{T} ‖}_{F} = {‖ Λ - \bar{Λ} ‖}_{F} \leq ϵ / 3

using again the facts that Q and

\bar{Q}

are both orthonormal. This concludes that

{‖ \bar{A} - A ‖}_{F} \leq ϵ

; thus,

| \bar{S_{R}} |

is indeed a

ϵ -

net for S_R, in the Frobenius norm, of cardinality at most

{(9 R / ϵ)}^{d r + r}

As a side remark, observe that we gain an extra factor of two in the exponent owing to the fact that A is positive semidefinite (otherwise, the bound would be ${(9 R / ϵ)}^{2 d r + r}$ ).

4.6. Proof of Theorem 3

We first establish the following proposition for any W, which is a stationary point of the population risk.

Proposition 1.

Let $D^{*} \in R^{d \times d}$ be a diagonal matrix with $D_{i i}^{*} = {({(W^{*})}^{T} W^{*})}_{i i}$ and define $D \in R^{d \times d}$ analogously. Then, $W \in R^{m \times d}$ enjoys the stationarity equation

\begin{array}{l} (μ_{4} - 3 μ_{2}^{2}) W D^{*} + μ_{2}^{2} W {‖ W^{*} ‖}_{F}^{2} + 2 μ_{2}^{2} (W {(W^{*})}^{T} W^{*}) \\ = (μ_{4} - 3 μ_{2}^{2}) W D + μ_{2}^{2} W {‖ W ‖}_{F}^{2} + 2 μ_{2}^{2} (W (W^{T} W)) . \end{array}

To that end, fix a $k_{0} \in [m]$ and $ℓ_{0} \in [d]$ . Note that $\nabla_{k_{0}, ℓ_{0}} L (W) = E [\nabla_{k_{0}, ℓ_{0}} {(f (W^{*}; X) - f (W; X))}^{2}]$ , using the dominated convergence theorem. Next, $E [\nabla_{k_{0}, ℓ_{0}} {(f (W^{*}; X) - f (W; X))}^{2}] = 0$ implies that, for every $k_{0} \in [m]$ and $ℓ_{0} \in [d]$ ,

\sum_{j = 1}^{m} E [{〈 W_{j}^{*}, X 〉}^{2} 〈 W_{k_{0}}, X 〉 X_{ℓ_{0}}] = \sum_{j = 1}^{m} E [{〈 W_{j}, X 〉}^{2} 〈 W_{k_{0}}, X 〉 X_{ℓ_{0}}] .

Note next that $\sum_{j = 1}^{m} E [{〈 W_{j}^{*}, X 〉}^{2} 〈 W_{k_{0}}, X 〉 X_{ℓ_{0}}]$ computes as

μ_{4} \sum_{j = 1}^{m} {(W_{j, ℓ_{0}}^{*})}^{2} W_{k_{0}, ℓ_{0}} + μ_{2}^{2} \sum_{j = 1}^{m} \sum_{1 \leq ℓ \leq d, ℓ \neq ℓ_{0}} W_{k_{0}, ℓ_{0}} {(W_{j, ℓ}^{*})}^{2} + 2 μ_{2}^{2} \sum_{j = 1}^{m} \sum_{1 \leq ℓ \leq d, ℓ \neq ℓ_{0}} W_{k_{0}, ℓ} W_{j, ℓ}^{*} W_{j, ℓ_{0}}^{*} .

We now put this object into a more convenient form. Notice that the preceding expression is

(μ_{4} - 3 μ_{2}^{2}) A_{k_{0}, ℓ_{0}} + μ_{2}^{2} B_{k_{0}, ℓ_{0}} + 2 μ_{2}^{2} C_{k_{0}, ℓ_{0}},

where

A_{k_{0}, ℓ_{0}} = W_{k_{0}, ℓ_{0}} \sum_{j = 1}^{m} {(W_{j, ℓ_{0}}^{*})}^{2} and B_{k_{0}, ℓ_{0}} = \sum_{j = 1}^{m} \sum_{ℓ = 1}^{d} W_{k_{0}, ℓ_{0}} {(W_{j, ℓ}^{*})}^{2} and C_{k_{0}, ℓ_{0}} = \sum_{j = 1}^{m} \sum_{ℓ = 1}^{d} W_{k_{0}, ℓ} W_{j, ℓ_{0}}^{*} W_{j, ℓ}^{*} .

Observe that $B_{k_{0}, ℓ_{0}} = W_{k_{0}, ℓ_{0}} {‖ W^{*} ‖}_{F}^{2}$ . We now study $A_{k_{0}, ℓ_{0}}$ and $C_{k_{0}, ℓ_{0}}$ more carefully. Observe that $\sum_{j = 1}^{m} {(W_{j, ℓ_{0}}^{*})}^{2} = {({(W^{*})}^{T} W^{*})}_{ℓ_{0}, ℓ_{0}}$ . Now, let $D^{*} \in R d \times d$ be a diagonal matrix in which ${(D^{*})}_{i j} = {({(W^{*})}^{T} W^{*})}_{i i}$ if i = j and zero otherwise. We then have $A_{k_{0}, ℓ_{0}} = {(W D^{*})}_{k_{0}, ℓ_{0}}$ .

We now study $C_{k_{0}, ℓ_{0}}$ . Recall that $W_{i}^{*}$ is the ith row $W^{*}$ . Observe that $\sum_{j = 1}^{m} W_{j, ℓ_{0}}^{*} W_{j, ℓ}^{*} = {({(W^{*})}^{T} W^{*})}_{ℓ_{0}, ℓ}$ . Hence,

\sum_{j = 1}^{m} \sum_{ℓ = 1}^{d} W_{k_{0}, ℓ} W_{j, ℓ_{0}}^{*} W_{j, ℓ}^{*} = \sum_{ℓ = 1}^{d} \sum_{j = 1}^{m} W_{k_{0}, ℓ} W_{j, ℓ_{0}}^{*} W_{j, ℓ}^{*} = \sum_{ℓ = 1}^{d} W_{k_{0}, ℓ} {({(W^{*})}^{T} W^{*})}_{ℓ_{0}, ℓ} = {(W ({(W^{*})}^{T} W^{*}))}_{k_{0}, ℓ_{0}},

that is,

C_{k_{0}, ℓ_{0}} = {(W ({(W^{*})}^{T} W^{*}))}_{k_{0}, ℓ_{0}}

. Combining everything, we have that, for every

k_{0} \in [m]

and

ℓ_{0} \in [d]

\sum_{j = 1}^{m} E [{〈 W_{j}^{*}, X 〉}^{2} 〈 W_{k_{0}}, X 〉 X_{ℓ_{0}}] = (μ_{4} - 3 μ_{2}^{2}) {(W D^{*})}_{k_{0}, ℓ_{0}} + μ_{2}^{2} W_{k_{0}, ℓ_{0}} {‖ W^{*} ‖}_{F}^{2} + 2 μ_{2}^{2} {(W ({(W^{*})}^{T} W^{*}))}_{k_{0}, ℓ_{0}} .

In particular, stationarity yields

(μ_{4} - 3 μ_{2}^{2}) W D^{*} + μ_{2}^{2} W {‖ W^{*} ‖}_{F}^{2} + 2 μ_{2}^{2} (W ({(W^{*})}^{T} W^{*})) = (μ_{4} - 3 μ_{2}^{2}) W D + μ_{2}^{2} W {‖ W ‖}_{F}^{2} + 2 μ_{2}^{2} W (W^{T} W),

(9)

where the d × d diagonal matrix

D

is defined as

D_{i i} = {(W^{T} W)}_{i i}

, and entry-wise equalities are converted into equality of two matrices by varying

k_{0} \in [m]

and

ℓ_{0} \in [d]

. □

Having now established Proposition 1 for the stationarity equation, we now study its implications for any full-rank W.

Let $W \in R^{m \times d}$ be a stationary point with $rank (W) = d$ . We first establish ${‖ W ‖}_{F} = {‖ W^{*} ‖}_{F}$ . Because $W \in R^{m \times d}$ is a stationary point, it holds that, for every $(k_{0}, ℓ_{0}) \in [m] \times [d], \nabla_{k_{0}, ℓ_{0}} L (W) = 0$ . In particular, Equation (9) holds.

Recalling now that W is full rank, it follows from the rank-nullity theorem that $ker (W)$ is trivial, that is, $ker (W) = {0}$ . Hence, for matrices M₁, M₂ (with matching dimensions), whenever WM₁ = WM₂ holds, we deduce M₁ = M₂ because each column of $M_{1} - M_{2}$ is contained in $ker (W)$ . Thus, Equation (9) then yields

(μ_{4} - 3 μ_{2}^{2}) D^{*} + μ_{2}^{2} {‖ W^{*} ‖}_{F}^{2} I_{d} + 2 μ_{2}^{2} {(W^{*})}^{T} W^{*} = (μ_{4} - 3 μ_{2}^{2}) D + μ_{2}^{2} {‖ W ‖}_{F}^{2} I_{d} + 2 μ_{2}^{2} W^{T} W .

(10)

Next, note that $trace (D^{*}) = \sum_{i = 1}^{d} ((W^{*})^{T} W^{*})_{i i} = trace ({(W^{*})}^{T} W^{*}) = {‖ W^{*} ‖}_{F}^{2}$ , and similarly, $trace (D) = {‖ W ‖}_{F}^{2}$ . In particular, taking traces of both sides in Equation (10), we get

(μ_{4} - μ_{2}^{2}) {‖ W^{*} ‖}_{F}^{2} + μ_{2}^{2} d {‖ W^{*} ‖}_{F}^{2} = (μ_{4} - μ_{2}^{2}) {‖ W ‖}_{F}^{2} + μ_{2}^{2} d {‖ W ‖}_{F}^{2},

implying that

{‖ W^{*} ‖}_{F}^{2} = {‖ W ‖}_{F}^{2}

. Incorporating this into Equation (10), we then arrive at

(μ_{4} - 3 μ_{2}^{2}) D^{*} + 2 μ_{2}^{2} {(W^{*})}^{T} W^{*} = (μ_{4} - 3 μ_{2}^{2}) D + 2 μ_{2}^{2} W^{T} W .

Now, suppose $i \in [d]$ . Note that inspecting the preceding (i, i) coordinate, we get

(μ_{4} - 3 μ_{2}^{2}) {({(W^{*})}^{T} W^{*})}_{i i} + 2 μ_{2}^{2} {({(W^{*})}^{T} W^{*})}_{i i} = (μ_{4} - 3 μ_{2}^{2}) {(W^{T} W)}_{i i} + 2 μ_{2}^{2} {(W^{T} W)}_{i i} .

Because $μ_{4} - μ_{2}^{2} = Var (X_{i}^{2}) > 0$ , we then get

{({(W^{*})}^{T} W^{*})}_{i i} = {(W^{T} W)}_{i i} .

Now, focus on off-diagonal entries by fixing $i \neq j$ . Observe that, because $Var (X_{i}^{2}) > 0$ , it also holds that $E [X_{i}^{2}] = μ_{2} > 0$ . Now, note that, $D_{i j}^{*} = D_{i j} = 0$ in this case. We then have

2 μ_{2} {({(W^{*})}^{T} W^{*})}_{i j} = 2 μ_{2} {(W^{T} W)}_{i j} \Rightarrow {(W^{*})}^{T} W^{*} = W^{T} W .

We conclude that the matrix ${(W^{*})}^{T} W^{*} - W^{T} W$ is a zero matrix. Hence, $W = Q W^{*}$ for some orthonormal $Q \in R^{m \times m}$ per Theorem 14, and $L (W) = 0$ .

4.7. Proof of Theorem 5

Let ${W_{t}}_{t \geq 0}$ be a sequence of m × d matrices corresponding to the weights along the trajectory of gradient descent, that is, $W_{t} \in R^{m \times d}$ is the weight matrix at iteration t of the algorithm. We first show $L < \infty$ . To see this, recall Theorem 12(c): $L (W) \geq μ_{2}^{2} \cdot trace {(A)}^{2}$ , where $trace (A) = {‖ W ‖}_{F}^{2} - {‖ W^{*} ‖}_{F}^{2}$ . In particular, this yields $μ_{2}^{2} {({‖ W ‖}_{F}^{2} - {‖ W^{*} ‖}_{F}^{2})}^{2} \leq L (W)$ . Hence, for any W with $L (W) \leq L (W_{0})$ , it holds that

{‖ W ‖}_{F} \leq {(\frac{\sqrt{L (W_{0})}}{μ_{2}} + {‖ W^{*} ‖}_{F}^{2})}^{1 / 2} < \infty .

Namely, the (Frobenius) norm of the weights of any W with $L (W) \leq L (W_{0})$ remains uniformly bounded from above. This, in turn, yields that the (spectral norm of the) Hessian of the objective function remains uniformly bound from above for any such W because the objective is a polynomial function of W, which is precisely what we denote by L.

We now run gradient descent with a step size of $η < 1 / 2 L$ : a second order Taylor expansion reveals that

L (W_{1}) - L (W_{0}) \leq - η {‖ \nabla L (W_{0}) ‖}_{2}^{2} / 2,

where

\nabla L (W)

is the gradient of the population risk, evaluated at W.

In particular, $L (W_{1}) \leq L (W_{0})$ , and furthermore, $‖ \nabla^{2} L (W_{1}) ‖ \leq L$ , where $‖ \nabla^{2} L (W) ‖$ is the spectral norm of the Hessian matrix $\nabla^{2} L (W)$ . From here, we induct on k: the induction argument reveals that we can retain a step size of $η < 1 / 2 L$ , and furthermore, we deduce that the gradient descent trajectory ${W_{k}}_{k \geq 0}$ is such that (i) $L (W_{k}) \geq L (W_{k + 1})$ for every $k \geq 0$ , and furthermore, (ii) it holds for every $k \geq 0$ that

L (W_{k + 1}) - L (W_{k}) \leq - η {‖ \nabla L (W_{k}) ‖}_{2}^{2} / 2 .

We now establish that ${‖ \nabla L (W_{k}) ‖}_{2} \to 0$ as $k \to \infty$ . Note that the objective function is lower bounded (by zero). If the gradient is nonvanishing, then (by passing to a subsequence if necessary) each step reduces the value of the objective function at least by a certain amount, that is (uniformly) bounded away from zero. But this contradicts the fact that the objective is lower bounded. Thus, we deduce

\lim_{k \to \infty} {‖ \nabla L (W_{k}) ‖}_{2} = 0 .

Now, recall that the trajectory is such that $L (W_{k}) \geq L (W_{k + 1})$ and ${‖ \nabla L (W_{k}) ‖}_{2} \to 0$ as $k \to \infty$ . Suppose that the initial value, $L (W_{0})$ , is such that

L (W_{0}) < \min_{W \in R^{m \times d} : rank (W) < d} L (W) .

In particular, for every $k \in Z^{+}$ ,

L (W_{k}) \leq L (W_{0}) < \min_{W \in R^{m \times d} : rank (W) < d} L (W) .

(11)

Therefore, $W_{k} \in R^{m \times d}$ is full rank for all k, per Theorem 1. We now establish

\lim_{k \to \infty} L (W_{k}) = 0 .

To see this, observe that the sequence ${L (W_{k})}_{k \geq 0}$ is monotonic (nonincreasing) and, furthermore, is bounded by zero from below. Hence,

\lim_{k \to \infty} L (W_{k}) ≜ ℓ

exists (Rudin [76, theorem 3.14]. We now show

ℓ = 0

Because the weights remain bounded along the trajectory, it follows that there exists a subsequence ${W_{k_{n}}}_{n \in N}$ with a limit, that is, $W_{k_{n}} \to W^{\infty}$ as $n \to \infty$ , where $W^{\infty} \in R^{m \times d}$ . Now, the continuity of $\nabla L$ , together with the continuity of the norm ${‖ \cdot ‖}_{2}$ , imply that ${‖ \nabla L (W^{\infty}) ‖}_{2} = 0$ . Furthermore, continuity of $L (\cdot)$ then implies $L (W^{\infty}) = ℓ$ . Now, because $W_{k_{n}}$ s are such that $L (W_{k_{n}}) \leq L (W_{0})$ for all $n \in N$ and $L (W_{0})$ is strictly smaller than the rank-deficient energy barrier, by taking limits as $k \to \infty$ and using (11), we conclude that $W^{\infty}$ is full rank. Because $W^{\infty}$ is also a stationary point of the loss, by Theorem 3, we deduce $L (W^{\infty}) = 0$ , which yields $ℓ = 0$ as desired.

4.8. Proof of Theorem 7

4.8.1. Part (a).

Let $t = \sqrt{d}$ . Then, using Theorem 15, it holds that, with probability $1 - 2 \exp (- d / 2)$ ,

\begin{array}{l} \sqrt{m} - 2 \sqrt{d} & \leq σ_{min} (W^{*}) \leq σ_{max} (W^{*}) \leq \sqrt{m} + 2 \sqrt{d} \\ \Rightarrow m + 4 d - 4 \sqrt{m d} & \leq λ_{min} ({(W^{*})}^{T} W^{*}) \leq λ_{max} ({(W^{*})}^{T} W^{*}) \leq m + 4 d + 4 \sqrt{m d} . \end{array}

Recall that $σ (A)$ denotes the spectrum of A, that is, $σ (A) = {λ : λ is an eigenvalue of A}$ . We claim then the spectrum of $γ I - A$ is $γ - σ (A)$ . To see this, simply note the following line of reasoning:

γ - λ \in σ (γ I - A) \Leftrightarrow det ((γ - λ) I - (γ I - A)) = 0 \Leftrightarrow det (λ I - A) = 0 \Leftrightarrow λ \in σ (A) .

Now, let $W_{0} \in R^{m \times d}$ be such that $W_{0}^{T} W_{0} = γ I$ with $γ = m + 4 d$ . In particular, if $λ_{1} \leq \dots \leq λ_{d}$ are the eigenvalues of $γ I - {(W^{*})}^{T} W^{*}$ with $γ = m + 4 d$ ; then, it holds that

- 4 \sqrt{m d} \leq λ_{1} \leq \dots \leq λ_{d} \leq 4 \sqrt{m d} .

Now, recall by Theorem 12(c) that

L (W_{0}) \leq μ_{2}^{2} {(\sum_{i = 1}^{d} λ_{i})}^{2} + \max {Var (X_{i}^{2}), 2 E {[X_{i}^{2}]}^{2}} (\sum_{i = 1}^{d} λ_{i}^{2}),

where

σ (W_{0}^{T} W_{0} - {(W^{*})}^{T} W^{*}) = {λ_{1}, \dots, λ_{d}}

. For the second term, we immediately have

\sum_{i = 1}^{d} λ_{i}^{2} \leq 16 m d^{2}

For the first term, note first that, if $λ_{1}^{'} \leq \dots \leq λ_{d}^{'}$ are the eigenvalues of ${(W^{*})}^{T} W^{*}$ , then

\sum_{k = 1}^{d} λ_{k}^{'} = trace ({(W^{*})}^{T} W^{*}) = \sum_{i = 1}^{m} \sum_{j = 1}^{d} {(W_{i j}^{*})}^{2} \Rightarrow \sum_{k = 1}^{d} (λ_{k}^{'} - m) = \sum_{i = 1}^{m} \sum_{j = 1}^{d} ((W_{i j}^{*})^{2} - 1),

where

W_{i j}^{*} \overset{d}{=} N (0, 1)

i.i.d. Note also that,

{(W_{i j}^{*})}^{2} - 1

is a centered random variable and has a subexponential tail; see Vershynin [91, lemma 5.14]. Now, letting

Z_{i j} = {(W_{i j}^{*})}^{2} - 1

and applying the Bernstein-type inequality (Vershynin [91, proposition 5.16]), we have that, for some absolute constants

K, c > 0

, it holds

P (| \sum_{i = 1}^{m} \sum_{j = 1}^{d} Z_{i j} | > d \sqrt{m}) \leq 2 \exp (- c \min (\frac{d}{K^{2}}, \frac{d \sqrt{m}}{K})) \leq 2 \exp (- c d / K^{2}) = \exp (- Ω (d)),

for m sufficiently large. In particular, with probability at least

1 - \exp (- Ω (d))

, it, therefore, holds that

| \sum_{k = 1}^{d} (λ_{k}^{'} - m) | \leq d \sqrt{m} .

Finally, using the triangle inequality,

| \sum_{k = 1}^{d} λ_{k} | = | \sum_{k = 1}^{d} (λ_{k}^{'} - (m + 4 d)) | \leq | \sum_{k = 1}^{d} (λ_{k}^{'} - m) | + 4 d^{2} \leq d \sqrt{m} + 4 d^{2},

with probability

1 - \exp (- Ω (d))

. After squaring, we obtain that

{(\sum_{i = 1}^{d} λ_{i})}^{2} \leq 16 d^{4} + 8 d^{3} \sqrt{m} + d^{2} m

. In particular, we get

\begin{array}{l} L (W_{0}) & \leq μ_{2}^{2} {(\sum_{i = 1}^{d} λ_{i})}^{2} + \max {Var (X_{i}^{2}), 2 E {[X_{i}^{2}]}^{2}} (\sum_{i = 1}^{d} λ_{i}^{2}) \\ \leq μ_{2}^{2} (16 d^{4} + 8 d^{3} \sqrt{m} + m d^{2}) + \max {Var (X_{i}^{2}), 2 E {[X_{i}^{2}]}^{2}} 16 m d^{2} . \end{array}

Using now the overparameterization $m > C d^{2}$ , we further have

E {[X_{i}^{2}]}^{2} (16 d^{4} + 8 d^{3} \sqrt{m} + m d^{2}) + \max {Var (X_{i}^{2}), 2 E {[X_{i}^{2}]}^{2}} 16 m d^{2} \leq C' (C) m^{2},

where

C' (C) = E {[X_{i}^{2}]}^{2} (\frac{16}{C^{2}} + \frac{8}{C^{3 / 2}} + \frac{1}{C}) + \frac{16}{C} \max {Var (X_{i}^{2}), 2 E {[X_{i}^{2}]}^{2}} .

Note that, for the constant $C' (C)$ ,

C' (C) \to 0 as C \to \infty .

Next, observe that, $\sqrt{m} - 2 \sqrt{d} \geq \frac{1}{2} \sqrt{m}$ for m large (in the regime $m > C d^{2}$ with C large enough). Thus, using what we establish in Theorem 1, we arrive at

\begin{array}{l} \min_{W \in R^{m \times d} : rank (W) < d} L (W) & > \min {Var (X_{i}^{2}), 2 E {[X_{i}^{2}]}^{2}} σ_{min} {(W^{*})}^{4} \\ \geq \min {Var (X_{i}^{2}), 2 E {[X_{i}^{2}]}^{2}} {(\sqrt{m} - 2 \sqrt{d})}^{4} \\ \geq \frac{1}{16} \min {Var (X_{i}^{2}), 2 E {[X_{i}^{2}]}^{2}} m^{2} . \end{array}

Finally, observe also that, if $Var (X_{i}^{2}) > 0$ , then $E [X_{i}^{2}] > 0$ as well: indeed observe that, if $E [X_{i}^{2}] = 0$ , then X_i = 0 almost surely, for which $Var (X_{i}^{2}) = 0$ . In particular, $\min {Var (X_{i}^{2}), 2 E {[X_{i}^{2}]}^{2}} > 0$ . Equipped with this, we then observe that provided

\frac{1}{16} \min {Var (X_{i}^{2}), 2 E {[X_{i}^{2}]}^{2}} > C' (C) = E {[X_{i}^{2}]}^{2} (\frac{16}{C^{2}} + \frac{8}{C^{3 / 2}} + \frac{1}{C}) + \frac{16}{C} \max {Var (X_{i}^{2}), 2 E {[X_{i}^{2}]}^{2}},

that is, provided C > 0 is sufficiently large, we are done.

4.8.2. Part (b).

Note that the result of Bai and Yin [3] asserts that, if ${μ_{1}, \dots, μ_{d}}$ are the eigenvalues of

A ≜ \frac{1}{2 \sqrt{m d}} ({(W^{*})}^{T} W^{*} - m I_{d}),

and if we define the empirical measure

F^{A} (x) = \frac{1}{d} | {i : μ_{i} \leq x} |,

then in the regime

d \to + \infty, d / m \to 0

, it holds that

F^{A} (x) \to ω (x),

almost surely, where

ω (x)

is the semicircle law, and moreover,

\frac{1}{d} \sum_{i = 1}^{d} μ_{i}^{2} \to \int x^{2} d ω (x) ≜ χ_{2},

namely, χ₂ is, respectively, the second moment under semicircle law, w.h.p. Now, define the same quantities as in proof of part (a), where this time

W_{0}^{T} W_{0} = m I_{d}

, and

{λ_{1}, \dots, λ_{d}} = σ ({(W^{*})}^{T} W^{*} - m I_{d})

. In particular, we still retain the inequality per Theorem 12(c):

L (W_{0}) \leq μ_{2}^{2} {(\sum_{i = 1}^{d} λ_{i})}^{2} + \max {Var (X_{i}^{2}), 2 E {[X_{i}^{2}]}^{2}} (\sum_{i = 1}^{d} λ_{i}^{2}) .

Note that $λ_{i} = 2 \sqrt{m d} μ_{i}$ . Hence, we obtain

\sum_{i = 1}^{d} λ_{i}^{2} < (4 + o (1)) m d^{2} χ_{2}

w.h.p. We now control

\sum_{i = 1}^{d} λ_{i}

using the central limit theorem (CLT). Observe that

\sum_{i = 1}^{d} λ_{i} = trace ({(W^{*})}^{T} W^{*} - m I_{d}) = \sum_{i = 1}^{m} \sum_{j = 1}^{d} ((W_{i j}^{*})^{2} - 1) .

Now, note that

σ_{*}^{2} ≜ Var ({(W_{i j}^{*})}^{2} - 1) = Var ({(W_{i j}^{*})}^{2}) < E [{(W_{i j}^{*})}^{4}] < \infty .

We now use CLT as $d \to \infty$ and $m / d \to \infty$ . To that end, let $1 / 2 > ϵ > 0$ be fixed. Observe now that, for any arbitrary M > 0, and sufficiently large d,

{- 1 \leq \frac{1}{σ_{*} \sqrt{m d} d^{ϵ}} \sum_{i = 1}^{m} \sum_{j = 1}^{d} ((W_{i j}^{*})^{2} - 1) \leq 1} \supset {- M \leq \frac{1}{σ_{*} \sqrt{m d}} \sum_{i = 1}^{m} \sum_{j = 1}^{d} ((W_{i j}^{*})^{2} - 1) \leq M} .

In particular, using the central limit theorem, we deduce

\underset{d \to \infty}{lim inf} P (- 1 \leq \frac{1}{σ_{*} \sqrt{m d} d^{ϵ}} \sum_{i = 1}^{m} \sum_{j = 1}^{d} ((W_{i j}^{*})^{2} - 1) \leq 1) \geq P (Z \in [- M, M]),

where Z is a standard normal random variable. Now, because M > 0 is arbitrary, by sending

M \to + \infty

, we obtain

\underset{d \to \infty}{lim inf} P (- 1 \leq \frac{1}{σ_{*} \sqrt{m d} d^{ϵ}} \sum_{i = 1}^{m} \sum_{j = 1}^{d} ((W_{i j}^{*})^{2} - 1) \leq 1) \geq 1,

and we then conclude

\lim_{d \to \infty} P (- 1 \leq \frac{1}{σ_{*} \sqrt{m d} d^{ϵ}} \sum_{i = 1}^{m} \sum_{j = 1}^{d} ((W_{i j}^{*})^{2} - 1) \leq 1) = 1 .

Hence,

| \sum_{i = 1}^{d} λ_{i} | \leq σ_{*} \sqrt{m d} d^{ϵ},

with probability

1 - o_{d} (1)

for d sufficiently large.

Moreover,

σ_{min} {(W^{*})}^{4} \geq \frac{1}{16} m^{2},

for m large, using yet another result of Bai and Yin, see Theorem 16. From here, carrying the exact same analysis as in part (a), we obtain that, provided

m > C d^{2}

for some large constant C > 0 and d sufficiently large, the following holds with probability

1 - o_{d} (1)

L (W_{0}) < \min_{W \in R^{m \times d} : rank (W) < d} L (W),

where W₀ is prescribed such that

W_{0}^{T} W_{0} = m I_{d}

4.9. Proof of Theorem 2

First, let

S_{1} ≜ {W \in R^{m \times d} : rank (W) < d, \hat{L} (W) < \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4}} .

We start with the following claim.

Claim 1.

In the setting of Theorem 2, the following holds. With probability at least $1 - \exp (- C' d)$ (where $C' > 0$ is some absolute constant), it holds that, for any $W \in R^{m \times d}$ with $\hat{L} (W) \leq \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4}$ ,

{‖ W ‖}_{F} \leq d^{K + 1},

provided

N \geq C^{‴} d

for some absolute constant

C^{‴} > 0

Proof of Claim 1.

For convenience, let ${\hat{L}}_{0} ≜ \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4}$ , and for the random data vector $X = (X_{1}, \dots, X_{d}) \in R^{d}$ , let $σ^{2} = E [X_{1}^{2}]$ . Recall that X has i.i.d. centered coordinates with a sub-Gaussian coordinate distribution.

We have the following, in which the implication is due to Cauchy–Schwarz:

{\hat{L}}_{0} \geq \frac{1}{N} \sum_{1 \leq i \leq N} (Y_{i} - f (X_{i}; W))^{2} \Rightarrow {({\hat{L}}_{0})}^{1 / 2} \geq | \frac{1}{N} \sum_{1 \leq i \leq N} (Y_{i} - f (X_{i}; W)) | .

We now establish that, with probability at least $1 - 2 \exp (- t^{2} d)$ , the following holds provided $N \geq C {(t / ϵ)}^{2} d$ : for every $W \in R^{m \times d}$ ,

| \frac{1}{N} \sum_{1 \leq i \leq N} X_{i}^{T} W^{T} W X_{i} - σ^{2} {‖ W ‖}_{F}^{2} | \leq ϵ σ^{2} {‖ W ‖}_{F}^{2} .

To see this, we begin by noticing $X_{i}^{T} W^{T} W X_{i} = trace (X_{i}^{T} W^{T} W X_{i}) = 〈 W^{T} W, X_{i} X_{i}^{T} 〉$ . Using this, we have

| \frac{1}{N} \sum_{1 \leq i \leq N} X_{i}^{T} W^{T} W X_{i} - σ^{2} {‖ W ‖}_{F}^{2} | = | 〈 W^{T} W, \frac{1}{N} \sum_{1 \leq i \leq N} X_{i} X_{i}^{T} - σ^{2} I_{d} 〉 | .

We now use Hölder’s inequality (Theorem 18) with $p = 1, q = \infty, U = W^{T} W$ and $V = \frac{1}{N} \sum_{i} X_{i} X_{i}^{T} - σ^{2} I_{d}$ . This yields

| 〈 W^{T} W, \frac{1}{N} \sum_{1 \leq i \leq N} X_{i} X_{i}^{T} - σ^{2} I_{d} 〉 | \leq {‖ W ‖}_{F}^{2} ‖ \frac{1}{N} \sum_{1 \leq i \leq N} X_{i} X_{i}^{T} - σ^{2} I_{d} ‖ .

Observing now $E [X_{i} X_{i}^{T}] = σ^{2} I_{d}$ , we have

‖ \frac{1}{N} \sum_{1 \leq i \leq N} X_{i} X_{i}^{T} - σ^{2} I_{d} ‖ \leq ϵ σ^{2}

with probability at least

1 - 2 \exp (- t^{2} d)

provided

N \geq C {(t / ϵ)}^{2} d

, using the concentration result on sample covariance matrix from Vershynin [91, corollary 5.50]. Hence, on this high probability event, the following holds:

\frac{1}{N} \sum_{1 \leq i \leq N} X_{i}^{T} (W^{*}) T W^{*} X_{i} \leq σ^{2} (1 + ϵ) {‖ W^{*} ‖}_{F}^{2} and \frac{1}{N} \sum_{1 \leq i \leq N} X_{i}^{T} W^{T} W X_{i} \geq σ^{2} (1 - ϵ) {‖ W ‖}_{F}^{2} .

Hence,

{\hat{L}}_{0} \geq \frac{1}{N} \sum_{1 \leq i \leq N} (X_{i}^{T} W^{T} W X_{i} - X_{i}^{T} (W^{*})^{T} W^{*} X_{i}) \geq σ^{2} (1 - ϵ) {‖ W ‖}_{F}^{2} - σ^{2} (1 + ϵ) {‖ W^{*} ‖}_{F}^{2} .

This yields, for any W with $\hat{L} (W) \leq {\hat{L}}_{0}$ ,

{‖ W ‖}_{F} \leq {(\frac{{({\hat{L}}_{0})}^{1 / 2}}{σ^{2} (1 - ϵ)} + \frac{1 + ϵ}{1 - ϵ} {‖ W^{*} ‖}_{F}^{2})}^{1 / 2}

with probability at least

1 - 2 \exp (- t^{2} d)

. Now, observe that

σ_{min} {(W^{*})}^{2} = λ_{min} ({(W^{*})}^{T} W^{*}) \leq trace ({(W^{*})}^{T} W^{*}) \leq {‖ W^{*} ‖}_{F}^{2} \leq d^{2 K} .

Furthermore, $C_{5} = O (1)$ . This yields

{\hat{L}}_{0} = \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4} = O (d^{4 K}) .

(12)

We now take $ϵ = 1 / 2$ and conclude that

{‖ W ‖}_{F} \leq {(\frac{{({\hat{L}}_{0})}^{1 / 2}}{σ^{2} (1 - ϵ)} + \frac{1 + ϵ}{1 - ϵ} {‖ W^{*} ‖}_{F}^{2})}^{1 / 2} \leq d^{K + 1}

for d large enough with probability at least

1 - 2 \exp (- t^{2} d)

, which is

1 - \exp (- C' d)

for some absolute constant

C' > 0

. □

Having established Claim 1, we now return to the proof of Theorem 2. Let

S_{2} ≜ {W \in R^{m \times d} : rank (W) < d, \hat{L} (W) < \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4}, {‖ W ‖}_{F} \leq d^{K + 1}} .

A consequence of Claim 1 is that $P (S_{1} = S_{2}) \geq 1 - \exp (- C' d)$ . We now establish the following.

Claim 2.

P (S_{2} = Ø) \geq 1 - {(9 d^{2 + 4 K + 7})}^{d^{2} - 1} \cdot \exp (- C_{3} N d^{- 4 - 4 K}) - N d e^{- C d} .

Note that combining Claims 1 and 2 through a union bound yields

\inf_{W \in R^{m \times d} : rank (W) < d} \hat{L} (W) \geq \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4},

with probability at least

1 - \exp (- C' d) - {(9 d^{9 + 4 K})}^{d^{2} - 1} \exp (- C_{3} N d^{- 4 - 4 K}) - N d e^{- C d},

therefore establishing Theorem 2.

Proof of Claim 2.

Let $A = W^{T} W \in R^{d \times d}$ . We claim ${‖ A ‖}_{F} \leq d^{2 K + 2}$ . To see this, note that ${‖ A ‖}_{F}^{2} = trace (A^{T} A) = trace (A^{2})$ . Let $θ_{1}, \dots, θ_{d}$ be the eigenvalues of A, all nonnegative as $A ⪰ 0$ , and $θ_{1}^{2}, \dots, θ_{d}^{2}$ are the eigenvalues of A². With this,

trace (A^{2}) = \sum_{1 \leq i \leq d} θ_{i}^{2} \leq {(\sum_{1 \leq i \leq d} λ_{i})}^{2} = trace {(A)}^{2} .

Hence, ${‖ A ‖}_{F} \leq trace (A) = {‖ W ‖}_{F}^{2} \leq d^{2 K + 2}$ as requested.

Next, let

S_{R} = {A \in R^{d \times d} : rank (A) \leq d - 1, A ⪰ 0, {‖ A ‖}_{F} \leq R};

let

{\bar{S}}_{ϵ}

be a

ϵ -

net for

S_{d^{2 K + 2}}

in the Frobenius norm, where ϵ is to be tuned appropriately later. Using Lemma 3, we have

| {\bar{S}}_{ϵ} | \leq {(\frac{9 d^{2 K + 2}}{ϵ})}^{d^{2} - 1} .

Now, applying Lemma 2 with $K_{1} = \frac{1}{2}$ and $K_{2} = K$ and taking a union bound across the net ${\bar{S}}_{ϵ}$ , we arrive at the following conclusion:

\begin{array}{l} P (\underset{A \in {\bar{S}}_{ϵ}}{\cup} \underset{E {(A)}^{c} from Lemma 2}{\underset{︸}{{\frac{1}{N} \sum_{1 \leq i \leq N} (Y_{i} - X_{i}^{T} A X_{i})^{2} < \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4}}}} | {‖ X_{i} ‖}_{\infty} \leq \sqrt{d}, 1 \leq i \leq N) \\ \leq {(\frac{9 d^{2 K + 2}}{ϵ})}^{d^{2} - 1} \exp (- C_{3} N d^{- 4 + 4 K}), \end{array}

where

C_{5} = \min {μ_{4} (1 / 2) - μ_{2} {(1 / 2)}^{2}, 2 μ_{2} {(1 / 2)}^{2}} and μ_{n} (K) = E [X_{i}^{n} | | X_{i} | \leq d^{K}] .

Now, because $P ({‖ X_{i} ‖}_{\infty} < \sqrt{d}, 1 \leq i \leq N) \geq 1 - N d \exp (- C d)$ by Lemma 1, we obtain

\begin{array}{l} P (\underset{A \in {\bar{S}}_{ϵ}}{\cap} {\frac{1}{N} \sum_{1 \leq i \leq N} (Y_{i} - X_{i}^{T} A X_{i})^{2} \geq \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4}}) \\ \geq 1 - {(\frac{9 d^{2 K + 2}}{ϵ})}^{d^{2} - 1} \cdot \exp (- C_{3} \frac{N}{d^{4 + 4 K}}) - N d \exp (- C d) . \end{array}

In the remainder of the proof, suppose for every $A \in {\bar{S}}_{ϵ}$ ,

\frac{1}{N} \sum_{1 \leq i \leq N} {(Y_{i} - X_{i}^{T} A X_{i})}^{2} \geq \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4},

and

{‖ X_{i} ‖}_{\infty} \leq d^{1 / 2}, i \in [N]

, which collectively hold with probability at least

1 - {(\frac{9 d^{2 K + 2}}{ϵ})}^{d^{2} - 1} \cdot \exp (- C_{3} \frac{N}{d^{4 + 4 K}}) - 2 N d \exp (- C d) .

Now, let $W \in R^{m \times d}$ with ${‖ W ‖}_{F} \leq d^{K + 1}, rank (W) \leq d - 1$ . Let $A = W^{T} W$ (thus, ${‖ A ‖}_{F} \leq d^{2 K + 2}$ ) and $\hat{A} \in {\bar{S}}_{ϵ}$ be such that ${‖ A - \hat{A} ‖}_{F} \leq ϵ$ . We now estimate

Δ ≜ | \frac{1}{N} \sum_{1 \leq i \leq N} {(Y_{i} - X_{i}^{T} A X_{i})}^{2} - \frac{1}{N} \sum_{1 \leq i \leq N} {(Y_{i} - X_{i}^{T} \hat{A} X_{i})}^{2} | .

For notational convenience, let $A^{*} = {(W^{*})}^{T} W^{*}$ . Now,

\begin{array}{l} Δ & \leq \frac{1}{N} \sum_{1 \leq i \leq N} | {(X_{i}^{T} (A - A^{*}) X_{i})}^{2} - {(X_{i}^{T} (\hat{A} - A^{*}) X_{i})}^{2} | \\ = \frac{1}{N} \sum_{1 \leq i \leq N} | X_{i}^{T} (A - \hat{A}) X_{i} | \cdot | X_{i}^{T} (A + \hat{A} - 2 A^{*}) X_{i} | . \end{array}

Now, using Cauchy–Schwarz (for inner product $〈 M, N 〉 ≜ trace (M^{T} N)$ ),

| X_{i}^{T} (A - \hat{A}) X_{i} | = | 〈 A - \hat{A}, X_{i} X_{i}^{T} 〉 | \leq {‖ A - \hat{A} ‖}_{F} \cdot {‖ X_{i} ‖}_{2}^{2},

using

{‖ X_{i} X_{i}^{T} ‖}_{F} = {‖ X_{i} ‖}_{2}^{2}

. In particular, we obtain

| X_{i}^{T} (A - \hat{A}) X_{i} | \leq ϵ d^{2} .

For the term $| X_{i}^{T} (A + \hat{A} - 2 A^{*}) X_{i} |$ , we observe that the triangle inequality yields

{‖ A + \hat{A} - 2 A^{*} ‖}_{F} \leq 4 d^{2 K + 2} .

Thus,

| X_{i}^{T} (A + \hat{A} - 2 A^{*}) X_{i} | \leq 4 d^{2 K + 4} .

Using these, we obtain

| \hat{L} (W) - \frac{1}{N} \sum_{1 \leq i \leq N} {(Y_{i} - X_{i}^{T} \hat{A} X_{i})}^{2} | \leq 4 ϵ d^{6 + 2 K} = O (d^{- 1}) = o_{d} (1),

taking

ϵ = d^{- 7 - 2 K}

. Using, finally, the fact that

\frac{1}{N} \sum_{1 \leq i \leq N} {(Y_{i} - X_{i}^{T} \hat{A} X_{i})}^{2}

is bounded away from zero across the net

{\bar{S}}_{ϵ}

, we conclude the proof of Claim 2. □

Because it is already noted that Claims 1 and 2 together yield Theorem 2, we complete the proof of Theorem 2.

4.9.1. Case of Constant d: $d = O (1)$ .

We now carry out a separate analysis for the case of constant d ( $d = O (1)$ ). We only point out the necessary modifications, hiding factors depending on the constant d under asymptotic notations.

In what follows, we use the fact that, if X is a sub-Gaussian random variable, then $E [| X |^{p}] < \infty$ for every $p \geq 1$ . For a proof, see Vershynin [91, lemma 5.5], which establishes a stronger conclusion that $E {[| X |^{p}]}^{1 / p} = O (\sqrt{p})$ for every $p \geq 1$ .

4.9.2. Modifying Claim 1.

Claim 1 modifies to the following: with probability at least $1 - O (1 / N)$ , it holds that, for any $W \in R^{m \times d}$ with $\hat{L} (W) \leq \frac{1}{2} \bar{C_{5}} σ_{min} {(W^{*})}^{4}$ ,

{‖ W ‖}_{F} = O (1) .

We now sketch the proof of this modified claim. For a matrix $M \in R^{d \times d}$ , denote by ${‖ M ‖}_{\infty} ≔ \max_{1 \leq i, j \leq d} | M_{i j} |$ and let $ϵ > 0$ be arbitrary. Then, we show that, over the randomness in $X_{i} \in R^{d}, 1 \leq i \leq N$ with probability at least $1 - O (1 / N)$ ,

{‖ \frac{1}{N} \sum_{1 \leq i \leq N} X_{i} X_{i}^{T} - σ^{2} I_{d} ‖}_{\infty} \leq ϵ .

Indeed, fix an $ϵ > 0$ . Then, for any $1 \leq j \leq d$ ,

P (| \frac{1}{N} \sum_{1 \leq i \leq N} X_{i} {(j)}^{2} - σ^{2} | \leq ϵ) \geq 1 - O (1 / N) .

This is by Chebyshev’s inequality (here, $X_{i} = (X_{i} (j) : 1 \leq j \leq d) \in R^{d}$ ). Here, we used, in particular, the fact that $E [X_{i} {(j)}^{4}] < \infty$ . Likewise, for any $1 \leq j < j' \leq d$ ,

P (| \frac{1}{N} \sum_{1 \leq i \leq N} X_{i} (j) X_{i} (j') | \leq ϵ) \geq 1 - O (1 / N) .

Taking a union bound over these $d (d + 1) / 2$ events corresponding to the entries of matrix $N^{- 1} \sum_{1 \leq i \leq N} X_{i} X_{i}^{T}$ , we are done. (Note that the number of events, $d (d + 1) / 2$ , is O(1). This, and other factors depending on ϵ are hidden under $O (\cdot)$ .) Using the trivial bound $‖ M ‖ \leq {‖ M ‖}_{F}^{2}$ , valid for any matrix $M \in R^{d \times d}$ , we arrive at the operator norm bound

‖ \frac{1}{N} \sum_{1 \leq i \leq N} X_{i} X_{i}^{T} - σ^{2} I_{d} ‖ \leq ϵ^{2} d^{2} .

Finally, taking $ϵ = 1 / (2 d^{2}) = O (1)$ , we finish the proof of modified Claim 1.

4.9.3. Modifying Claim 2.

Claim 2 now modifies to $P (S_{2} = Ø) \geq 1 - O (1 / N)$ , and this modified version is shown as follows. Note that, for any $ϵ = O (1)$ , the size of the net we consider is O(1). We claim that, with probability $1 - O (1 / N)$ , it holds that, for any $A \in {\bar{S}}_{ϵ}$ ,

\frac{1}{N} \sum_{1 \leq i \leq N} (Y_{i} - X_{i}^{T} A X_{i})^{2} \geq \frac{2}{3} \bar{C_{5}} σ_{min} {(W^{*})}^{4},

where

\bar{C_{5}} = \min {μ_{4} - μ_{2}^{2}, 2 μ_{2}^{2}} and μ_{n} = E [X_{1} {(1)}^{n}] .

To show this, fix an $A \in {\bar{S}}_{ϵ}$ . Now, instead of Lemma 2, one can apply Chebyshev’s inequality:

\frac{1}{N} \sum_{1 \leq i \leq N} (X_{i}^{T} A X_{i} - X_{i}^{T} {(W^{*})}^{T} W^{*} X_{i})^{2} \geq \frac{2}{3} E [{(X^{T} A X - X^{T} {(W^{*})}^{T} W^{*} X)}^{2}],

with probability at least

1 - O (1 / N)

. Here, we, in particular, used the fact

E [X_{i} {(j)}^{8}] < \infty

. Because

E [{(X^{T} A X - X^{T} {(W^{*})}^{T} W^{*} X)}^{2}] \geq \bar{C_{5}} σ_{min} {(W^{*})}^{4}

by Theorem 1, we establish the claim by taking a union bound over the net

{\bar{S}}_{ϵ}

, which has O(1) cardinality. The rest of the argument for Claim 2 remains (nearly) intact. In particular, the bound

{‖ A - \hat{A} ‖}_{F} \leq ϵ

remains intact, and

{‖ A + \hat{A} - 2 A^{*} ‖}_{F}

is now O(1). Finally, keeping in mind that

\frac{1}{N} \sum_{1 \leq i \leq N} {‖ X_{i} ‖}_{2}^{4} = O (1)

with probability

1 - O (1 / N)

, we complete the proof by taking ϵ small enough.

Putting these together as in the proof of Theorem 2, we complete the proof.

4.10. Proof of Theorem 4

We start by computing $\nabla \hat{L} (W)$ . Taking derivatives with respect to the jth row W_j of $W \in R^{m \times d}$ , we arrive at

\nabla_{W_{j}} \hat{L} (W) = \frac{4}{N} \sum_{1 \leq i \leq N} (\sum_{1 \leq j \leq m} {〈 W_{j}, X_{i} 〉}^{2} - Y_{i}) 〈 W_{j}, X_{i} 〉 X_{i} .

Interpreting these gradients as a row vector and aggregating into a matrix, we then have

\nabla_{W} \hat{L} (W) = W (\frac{4}{N} \sum_{1 \leq i \leq N} (\sum_{1 \leq j \leq m} {〈 W_{j}, X_{i} 〉}^{2} - Y_{i}) X_{i} X_{i}^{T}) .

Assume now that $rank (W) = d$ , and $\nabla \hat{L} (W) = 0$ . We then arrive at

\frac{1}{N} \sum_{1 \leq i \leq N} (\sum_{1 \leq j \leq m} {〈 W_{j}, X_{i} 〉}^{2} - Y_{i}) X_{i} X_{i}^{T} = 0 .

We now claim that $\hat{L} (W) = 0$ . To see this, we take a route similar to Soltanolkotabi et al. [83, lemma 6.1]. Let $M ≜ W^{T} W$ , and consider the function

f (M) ≜ \frac{1}{N} \sum_{1 \leq i \leq N} {(Y_{i} - X_{i}^{T} M X_{i})}^{2} .

Observe that $f (\cdot)$ is quadratic in M. Thus, for any $\hat{M}$ with $\nabla f (\hat{M}) = 0$ , that is,

\frac{1}{N} \sum_{1 \leq i \leq N} (X_{i}^{T} \hat{M} X_{i} - Y_{i}) X_{i} X_{i}^{T} = 0,

it is the case that

\hat{M}

is a global optimum of f. In particular, for any

M \in R^{d \times d}, f (M) \geq f (\hat{M})

. Now, take any

\bar{W} \in R^{m \times d}

, and observe that

\hat{L} (\bar{W}) = f ({\bar{W}}^{T} \bar{W})

. Because

\nabla f (W^{T} W) = 0

, it follows that

\hat{L} (\bar{W}) = f ({\bar{W}}^{T} \bar{W}) \geq f (W^{T} W) = \hat{L} (W) .

Namely, W is indeed a global optimizer of $\hat{L} (\cdot)$ . Because $W = W^{*}$ makes the cost zero, we obtain $\hat{L} (W) = 0$ .

Now, using Theorem 10, we obtain that $span (X_{i} X_{i}^{T} : 1 \leq i \leq N)$ is the set of all d × d symmetric matrices with probability one provided $N \geq d (d + 1) / 2$ . In this case, using Theorem 9, we conclude that $W^{T} W = {(W^{*})}^{T} W^{*}$ , concluding the proof.

4.11. Proof of Theorem 6

4.11.1. Part (a).

Note that, by Claim 1, it follows that, with probability at least $1 - \exp (- C' d)$ , it is the case that, for any W with $\hat{L} (W) \leq \hat{L} (W_{0}) < \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4}, {‖ W ‖}_{F} \leq d^{K + 1}$ . Now, let

E_{1} ≜ {\sup_{W : \hat{L} (W) \leq {\hat{L}}_{0}} {‖ W ‖}_{F} \leq d^{K + 1}};

(13)

thus,

P (E_{1}) \geq 1 - \exp (- C' d)

and

E_{2} ≜ {{‖ X_{i} ‖}_{\infty} \leq d^{1 / 2}, 1 \leq i \leq N},

(14)

such that

P (E_{2}) \geq 1 - N d \exp (- C d)

per Lemma 1.

Note that the $‖ \nabla^{2} \hat{L} (W) ‖ = poly ({‖ W ‖}_{F}, ‖ X_{1} ‖, \dots, ‖ X_{N} ‖)$ . Thus, on the event $E_{1} \cap E_{2}$ , which holds with probability at least $1 - N d \exp (- C d) - \exp (- C' d)$ , we have that

L = \sup {‖ \nabla^{2} \hat{L} (W) ‖ : \hat{L} (W) \leq {\hat{L}}_{0}} = poly (d) < + \infty

as claimed.

4.11.2. Part (b).

Suppose that the event $E_{1} \cap E_{2}$ (where $E_{1}$ and $E_{2}$ are defined, respectively, in (13) and (14)) takes place. We run the gradient descent with a step size of $η < 1 / 2 L$ . A second order Taylor expansion reveals that

\hat{L} (W_{1}) - \hat{L} (W_{0}) \leq - η {‖ \nabla \hat{L} (W_{0}) ‖}_{F}^{2} / 2,

where

\nabla \hat{L} (W)

is the gradient of the empirical risk evaluated at W. In particular,

\hat{L} (W_{1}) \leq \hat{L} (W_{0})

. Because

E_{1}

takes place, we conclude

‖ \nabla^{2} \hat{L} (W_{1}) ‖ \leq L = poly (d)

, where

‖ \nabla^{2} \hat{L} (W) ‖

is the spectral norm of the Hessian matrix

\nabla^{2} \hat{L} (W)

. From here, we induct on k; the induction argument reveals that we can retain a step size of

η < 1 / 2 L

(thus,

η = poly (d)

), and furthermore, along the trajectory

{W_{k}}_{k \geq 0}

, it holds that

\hat{L} (W_{k + 1}) - \hat{L} (W_{k}) \leq - η {‖ \nabla \hat{L} (W_{k}) ‖}_{F}^{2} / 2 .

Now, let T be the first time for which ${‖ \nabla \hat{L} (W) ‖}_{F} \leq ζ$ , namely, the horizon required to arrive at an $ζ -$ stationary point. In what follows, we carry out our analysis in terms of ζ. At the end, we incorporate the bound (4) on ζ.

We claim $T = poly (ζ^{- 1}, d, σ_{min} {(W^{*})}^{- 1})$ .

To see this, note that, from the definition of T, it holds that ${‖ \nabla \hat{L} (W_{t}) ‖}_{F} \geq ζ$ as $t \leq T - 1$ . Now, a telescoping argument together with $η = 1 / poly (d)$ reveals

\hat{L} (W_{T}) - \hat{L} (W_{0}) \leq - T {(poly (d))}^{- 1} ζ^{2} .

Using now $\hat{L} (W_{T}) \geq 0$ , we conclude $\hat{L} (W_{0}) \geq T ζ^{2} poly (d)$ . Because $\hat{L} (W_{0}) = {\hat{L}}_{0}$ is at most polynomial in d as per (12), we conclude $T = poly (ζ^{- 1}, d)$ .

We now turn our attention to bounding its risk. Let $r_{i} ≜ Y_{i} - X_{i}^{T} W^{T} W X_{i}$ . Note that $\hat{L} (W) = \frac{1}{N} \sum_{1 \leq i \leq N} r_{i}^{2}$ . Now,

\begin{array}{l} \hat{L} (W) & = \frac{1}{N} \sum_{1 \leq i \leq N} r_{i} (X_{i}^{T} {(W^{*})}^{T} W^{*} X_{i} - X_{i}^{T} W^{T} W X_{i}) \\ = 〈 W^{T} W - {(W^{*})}^{T} W^{*}, \frac{1}{N} \sum_{1 \leq i \leq N} r_{i} X_{i} X_{i}^{T} 〉 . \end{array}

Using the Cauchy–Schwarz inequality, we have

\begin{array}{l} \hat{L} (W) & = | 〈 W^{T} W - {(W^{*})}^{T} W^{*}, \frac{1}{N} \sum_{1 \leq i \leq N} r_{i} X_{i} X_{i}^{T} 〉 | \\ \leq {‖ W^{T} W - {(W^{*})}^{T} W^{*} ‖}_{F} \cdot {‖ \frac{1}{N} \sum_{1 \leq i \leq N} r_{i} X_{i} X_{i}^{T} ‖}_{F} . \end{array}

Next, ${‖ W^{T} W ‖}_{F}^{2} = trace ({(W^{T} W)}^{2}) \leq {(trace (W^{T} W))}^{2} = {‖ W ‖}_{F}^{4}$ , using the fact that $W^{T} W ⪰ 0$ . In particular, on the event $E_{1}$ defined as per (13), we conclude that ${‖ W ‖}_{F} \leq d^{K + 1}$ , and therefore, ${‖ W^{T} W ‖}_{F} \leq d^{3}$ . This, together with ${‖ W^{*} ‖}_{F} \leq d^{K}$ and the triangle inequality, then yields

{‖ W^{T} W - {(W^{*})}^{T} W^{*} ‖}_{F} \leq 2 d^{2 K + 2},

with probability at least

1 - \exp (- C' d)

. Hence, on this event,

\hat{L} (W) \leq 2 d^{2 K + 2} {‖ \frac{1}{N} \sum_{1 \leq i \leq N} r_{i} X_{i} X_{i}^{T} ‖}_{F} .

(15)

With this, we now turn our attention to bounding

{‖ \frac{1}{N} \sum_{1 \leq i \leq N} r_{i} X_{i} X_{i}^{T} ‖}_{F} .

We establish that, for the event

E_{3} ≜ {\inf_{\begin{matrix} W \in R^{m \times d} : σ_{min} (W) < \frac{1}{2} σ_{min} (W^{*}) \\ {‖ W ‖}_{F} \leq d^{K + 1} \end{matrix}} \hat{L} (W) \geq \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4}},

(16)

it is the case that

P (E_{3}) \geq 1 - {(9 d^{4 K + 9})}^{d^{2} - 1} \cdot \exp (- C_{4} N d^{- 4 - 4 K}) - N d \exp (- C d) .

(17)

This is almost a straightforward modification of the proof of the earlier energy barrier result Theorem 2, and we only point out required modifications. Take any $W \in R^{m \times d}$ with $σ_{min} (W) < \frac{1}{2} σ_{min} (W^{*})$ . In particular,

λ_{min} (W^{T} W) = σ_{min} {(W)}^{2} < \frac{1}{4} σ_{min} {(W^{*})}^{2} .

Inspecting now the proof of Theorem 1(a), we obtain that, for such a W,

E [{(X^{T} W^{T} W X - X^{T} {(W^{*})}^{T} W^{*} X)}^{2} | {‖ X ‖}_{\infty} \leq d^{1 / 2}] \geq \frac{3}{4} C_{5} σ_{min} {(W^{*})}^{4},

and consequently, modifying Lemma 2, we have that

\begin{array}{l} P (\frac{1}{N} \sum_{1 \leq i \leq N} {(Y_{i} - X_{i}^{T} W^{T} W X_{i})}^{2} \geq \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4} | {‖ X_{i} ‖}_{\infty} \leq \sqrt{d}, 1 \leq i \leq N) \\ \geq 1 - \exp (- C' N d^{- 4 - 4 K}) . \end{array}

Using now a covering numbers bound, in the same manner as in the proof of Theorem 2, we conclude that

\inf_{\begin{matrix} W \in R^{m \times d} : σ_{min} (W) < \frac{1}{2} σ_{min} (W^{*}) \\ {‖ W ‖}_{F} \leq d^{K + 1} \end{matrix}} \hat{L} (W) \geq \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4}

with probability at least

1 - {(9 d^{4 K + 9})}^{d^{2} - 1} \cdot \exp (- C_{4} N d^{- 4 - 4 K}) - N d \exp (- C d) .

Now, suppose in the remainder of this part that the event $E_{1} \cap E_{2} \cap E_{3}$ , which is

\begin{array}{l} {\sup_{W : \hat{L} (W) \leq {\hat{L}}_{0}} {‖ W ‖}_{F} \leq d^{K + 1}} \cap {{‖ X_{i} ‖}_{\infty} \leq d^{1 / 2}, 1 \leq i \leq N} \\ \cap {\inf_{\begin{matrix} W \in R^{m \times d} : σ_{min} (W) < \frac{1}{2} σ_{min} (W^{*}) \\ {‖ W ‖}_{F} \leq d^{K + 1} \end{matrix}} \hat{L} (W) \geq \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4}}, \end{array}

holds true. In particular, for any W with risk less than

\frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4}

, we have

σ_{min} (W) > \frac{1}{2} σ_{min} (W^{*}) > 0

(in particular, any such W is invertible). Now, take any

ζ -

stationary point W generated by the gradient descent. Because of the event

E_{3}

and the fact that

\hat{L} (W) < {\hat{L}}_{0}

, proven earlier, it holds that

rank (W) = d

, and from the definition of

ζ -

stationarity, we have

{‖ \nabla \hat{L} (W) ‖}_{F} \leq ζ .

Inspecting the proof of Theorem 4, we observe that

\nabla \hat{L} (W) = 4 W (\frac{1}{N} \sum_{1 \leq i \leq N} r_{i} X_{i} X_{i}^{T}) .

Thus, we arrive at

{‖ W (\frac{1}{N} \sum_{1 \leq i \leq N} r_{i} X_{i} X_{i}^{T}) ‖}_{F} \leq 4 ζ .

Let

B ≜ W (\frac{1}{N} \sum_{1 \leq i \leq N} r_{i} X_{i} X_{i}^{T}) .

Note now that

\frac{1}{N} \sum_{1 \leq i \leq N} r_{i} X_{i} X_{i}^{T} = {(W^{T} W)}^{- 1} W^{T} B .

Next, we have

{‖ {(W^{T} W)}^{- 1} ‖}_{2} = \frac{1}{σ_{min} (W^{T} W)} = \frac{1}{σ_{min} {(W)}^{2}} < \frac{4}{σ_{min} {(W^{*})}^{2}},

because of conditioning on

E_{3}

(16). Furthermore,

{‖ W^{T} ‖}_{2} = {‖ W ‖}_{2} = \sqrt{λ_{max} (W^{T} W)} \leq \sqrt{trace (W^{T} W)} = {‖ W ‖}_{F} \leq d^{K + 1} .

We now combine these findings.

\begin{array}{l} {‖ \frac{1}{N} \sum_{1 \leq i \leq N} r_{i} X_{i} X_{i}^{T} ‖}_{F} = {‖ {(W^{T} W)}^{- 1} W^{T} B ‖}_{F} \\ \leq {‖ {(W^{T} W)}^{- 1} ‖}_{2} {‖ W^{T} B ‖}_{F} \\ \leq {‖ {(W^{T} W)}^{- 1} ‖}_{2} {‖ W^{T} ‖}_{2} {‖ B ‖}_{F} \\ \leq 16 ζ σ_{min} {(W^{*})}^{- 2} d^{K + 1} . \end{array}

We now use the bounds on $P (E_{1})$ as per (13), on $P (E_{2})$ as per (14), and on $P (E_{3})$ as per (17) to control $P (E_{1} \cap E_{2} \cap E_{3})$ . We conclude by the union bound that, with probability at least

1 - \exp (- C' d) - {(9 d^{4 K + 9})}^{d^{2} - 1} \cdot \exp (- C_{4} N d^{- 4 - 4 K}) - N d \exp (- C d),

it holds that, for any W with

{‖ \nabla \hat{L} (W) ‖}_{F} \leq ζ

, its empirical risk is controlled as per (15):

\hat{L} (W) \leq 32 ζ σ_{min} {(W^{*})}^{- 2} d^{4 K + 4} .

(18)

Finally, because

ζ \leq \frac{ϵ}{32} σ_{min} {(W^{*})}^{2} d^{- 4 K - 4}

per (4), we deduce

\hat{L} (W) \leq ϵ

as claimed. The running time is polynomial in

ζ^{- 1}

and d, and therefore, is polynomial in

ϵ^{- 1}, σ_{min} {(W^{*})}^{- 1}

and d. This completes the proof of Part (b).

4.11.3. Part (c).

Let $W \in R^{m \times d}$ be such that $\hat{L} (W) \leq κ$ . Define the matrix

M ≜ W^{T} W - {(W^{*})}^{T} W^{*} .

We bound ${‖ M ‖}_{F}$ , which ensures weights W^TW are uniformly close to ground truth weights defined as ${(W^{*})}^{T} W^{*}$ . We start by conditioning: assume in the remainder that the event $E_{2}$ in (14) stating ${‖ X_{i} ‖}_{\infty} \leq d^{1 / 2}$ for every $i \in [N]$ is true; this holds with probability at least $1 - N d \exp (- C d)$ as per Lemma 1.

Note that

\hat{L} (W) = \frac{1}{N} \sum_{1 \leq i \leq N} {(X_{i}^{T} M X_{i})}^{2} .

To this end, consider a matrix $Ξ \in R^{N \times d (d + 1) / 2}$ , consisting of i.i.d. rows in which the $i th$ row of Ξ is $R_{i} ≜ (X_{i} {(1)}^{2}, \dots, X_{i} {(d)}^{2}, X_{i} (k) X_{i} (ℓ) : 1 \leq k < ℓ \leq d) \in R^{d (d + 1) / 2}$ . Next, let

Σ = E [R_{i} R_{i}^{T}] \in R^{\frac{d (d + 1)}{2} \times \frac{d (d + 1)}{2}},

where

R_{i}

is the

i th

row of matrix Ξ. Furthermore, let

M \in R^{d (d + 1) / 2}

be a vector consisting of entries

M_{11}, \dots, M_{d d}

and

2 M_{i j}, 1 \leq i < j \leq d

. With this notation, if

v = Ξ M \in R^{N \times 1}

, then we have

\hat{L} (W) = {‖ v ‖}_{2}^{2} / N \Rightarrow {‖ v ‖}_{2}^{2} \leq N κ,

because

\hat{L} (W) \leq κ

by assumption.

Next, we have

M = {(Ξ^{T} Ξ)}^{- 1} Ξ^{T} v \Rightarrow {‖ M ‖}_{2}^{2} \leq {‖ {(Ξ^{T} Ξ)}^{- 1} ‖}_{2}^{2} {‖ Ξ^{T} v ‖}_{2}^{2} .

(19)

We start with the second term. Recall that ${‖ v ‖}_{2} \leq \sqrt{N κ}$ , and we condition on ${‖ X_{i} ‖}_{\infty} < d^{1 / 2}, 1 \leq i \leq N$ . Next, using the Cauchy–Schwarz inequality,

| {(Ξ^{T} v)}_{i} | \leq {‖ v ‖}_{2} \sqrt{N d} \leq N d^{1 / 2} \sqrt{κ} .

(20)

Hence,

{‖ Ξ^{T} v ‖}_{2}^{2} \leq N^{2} d^{3} κ .

(21)

We now control ${‖ {(Ξ^{T} Ξ)}^{- 1} ‖}_{2}^{2}$ . This is done in a manner similar to the proof of Emschwiller et al. [33, theorem 3.2]. The main tool is the result Theorem 17 for concentration of the spectrum of random matrices with i.i.d. nonisotropic rows. The parameter setting we operate under is provided as follows.

Table

Parameter	Value
m	d ²
t	$N^{1 / 8}$
δ	$N^{- 3 / 8} d$
γ	$\max ({‖ Σ ‖}^{1 / 2} δ, δ^{2})$

Start by verifying that, because we condition on ${‖ X_{i} ‖}_{\infty} < d^{1 / 2}$ , it is indeed the case that the $ℓ_{2} -$ norm of each row of Ξ is at most d; thus, the preceding value of m works.

We now claim $γ = {‖ Σ ‖}^{1 / 2} δ$ . To prove this, it suffices to show

N > {‖ Σ ‖}^{- 4 / 3} d^{\frac{8}{3}} .

Using Emschwiller et al. [33, theorem 5.1] (also see Remark 2) with k = 2, we obtain $σ_{min} (Σ) \geq c d^{- 4}$ for some absolute constant c > 0 depending only on the data coordinate distribution. Consequently,

{‖ Σ ‖}^{- 4 / 3} \leq σ_{min} {(Σ)}^{- 4 / 3} \leq c^{- 4 / 3} d^{16 / 3} \Rightarrow {‖ Σ ‖}^{- 4 / 3} d^{\frac{8}{3}} < c^{- 4 / 3} d^{8},

which is below sample size N as requested. Therefore,

γ = {‖ Σ ‖}^{1 / 2} δ

We now claim

\frac{1}{2} σ_{min} (Σ) > γ = {‖ Σ ‖}^{1 / 2} N^{- \frac{3}{8}} d .

This is equivalent to establishing

N > 2^{8 / 3} \frac{{‖ Σ ‖}^{4 / 3} d^{\frac{8}{3}}}{σ_{min} {(Σ)}^{8 / 3}} .

Using again Emschwiller et al. [33, theorem 5.1], we have $‖ Σ ‖ < f d^{4}$ for some absolute constant f > 0. This yields

2^{8 / 3} \frac{{‖ Σ ‖}^{4 / 3} d^{\frac{8}{3}}}{σ_{min} {(Σ)}^{8 / 3}} < C' d^{\frac{56}{3}}

for some absolute constant

C' > 0

, which again holds for our case as

N > d^{18 + \frac{4}{3}}

The rest is verbatim from Emschwiller et al. [33]. We now apply Theorem 17. With probability at least $1 - d^{2} \exp (- c N^{1 / 4})$ (here, c > 0 is an absolute constant), it holds that

‖ \frac{1}{N} Ξ^{T} Ξ - Σ ‖ \leq γ .

(22)

Now, for $D = d (d + 1) / 2$ ,

‖ \frac{1}{N} Ξ^{T} Ξ - Σ ‖ \leq γ \Leftrightarrow \forall v \in R^{D}, | {‖ \frac{1}{\sqrt{N}} Ξ v ‖}_{2}^{2} - v^{T} Σ v | \leq γ {‖ v ‖}_{2}^{2},

which implies, for every v on the sphere

S^{D - 1} = {v \in S^{D} : {‖ v ‖}_{2} = 1}

\frac{1}{N} {‖ Ξ v ‖}_{2}^{2} \geq v^{T} Σ v - γ \Rightarrow \frac{1}{N} \inf_{v : ‖ v ‖ = 1} {‖ Ξ v ‖}_{2}^{2} \geq \inf_{v : ‖ v ‖ = 1} v^{T} Σ v - γ .

Now, using the Courant–Fischer variational characterization of the smallest singular value (Horn and Johnson [47]), we obtain

σ_{min} (Ξ) \geq N (σ_{min} (Σ) - γ) > \frac{N}{2} σ_{min} (Σ),

(23)

with probability at least

1 - \exp (- c' N^{1 / 4})

, where

c' > 0

is a positive absolute constant smaller than c.

We now return to (19) to specifically bound $‖ {(Ξ^{T} Ξ)}^{- 1} ‖$ . Let A be any matrix A. Note that $‖ A^{- 1} ‖ = σ_{min} {(A)}^{- 1}$ . Indeed, taking the singular value decomposition $A = U Σ V^{T}$ and observing $A^{- 1} = {(V^{T})}^{- 1} Σ^{- 1} U^{- 1}$ , we obtain $‖ A^{- 1} ‖ = \max_{i} {(σ_{i} (A))}^{- 1} = σ_{min} {(A)}^{- 1}$ . This, together with (23), yields

‖ {(Ξ^{T} Ξ)}^{- 1} ‖ \leq \frac{2}{N σ_{min} (Σ)},

(24)

with probability at least

1 - \exp (- c' N^{1 / 4})

We now have all ingredients to execute the bound in (19). Combining Equations (21) and (24), we get

\begin{array}{l} M = {(Ξ^{T} Ξ)}^{- 1} Ξ^{T} v \Rightarrow {‖ M ‖}_{2}^{2} & \leq {‖ {(Ξ^{T} Ξ)}^{- 1} ‖}_{2}^{2} \cdot {‖ Ξ^{T} v ‖}_{2}^{2} \\ \leq \underset{from (24)}{\underset{︸}{\frac{4}{N^{2} σ_{min} {(Σ)}^{2}}}} \cdot \underset{from (21)}{\underset{︸}{N^{2} d^{3} κ}} \\ = 4 κ σ_{min} {(Σ)}^{- 2} d^{3} \leq 4 C κ d^{11}, \end{array}

for some constant C > 0. Using (18) from Part (b), we have that κ can be taken

32 ζ σ_{min} {(W^{*})}^{- 2} d^{4 K + 4}

with probability at least

1 - \exp (- C' d) - {(9 d^{4 K + 9})}^{d^{2} - 1} \cdot \exp (- C_{4} N d^{- 4 - 4 K}) - N d \exp (- C d) .

Because ${‖ M ‖}_{2}^{2} \leq 4 C κ d^{11}$ with probability at least $1 - \exp (- c' N^{1 / 4})$ , we have that

{‖ M ‖}_{2} \leq C' \sqrt{ζ} d^{15 / 2 + 2 K} σ_{min} {(W^{*})}^{- 1}

with probability at least

1 - \exp (- c' N^{1 / 4}) - {(9 d^{4 K + 9})}^{d^{2} - 1} \cdot \exp (- C_{4} N d^{- 4 - 4 K}) - N d \exp (- C d),

by the union bound. As

{‖ M ‖}_{F} \leq {‖ M ‖}_{2}

and

\sqrt{ζ} \leq \frac{ϵ}{C'} d^{- 15 / 2 - 2 K} σ_{min} (W^{*})

per (4), we arrive at

{‖ W^{T} W - {(W^{*})}^{T} W^{*} ‖}_{F} \leq ϵ

as claimed.

We now show the generalization ability. For any $W \in R^{m \times d}$ , using auxiliary result Theorem 12(c), we have

L (W) \leq μ_{2}^{2} \cdot trace (M) + \max {μ_{4} - μ_{2}^{2}, 2 μ_{2}^{2}} \cdot trace (M^{2}),

where

M = W^{T} W - {(W^{*})}^{T} W^{*} \in R^{d \times d}

. Now, note that

trace {(M)}^{2} = | \sum_{1 \leq i \leq d} M_{i i} |^{2} \leq d \sum_{1 \leq i \leq d} M_{i i}^{2} \leq d {‖ M ‖}_{F}^{2}

by Cauchy–Schwarz. Furthermore,

trace (M^{2}) = trace (M^{T} M) = {‖ M ‖}_{F}^{2}

. Thus,

L (W) \leq {‖ M ‖}_{F}^{2} (d μ_{2}^{2} + \max {μ_{4} - μ_{2}^{2}, 2 μ_{2}^{2}}) \leq 2 d μ_{2}^{2} {‖ M ‖}_{F}^{2},

for d large. Because

{‖ M ‖}_{F}^{2} \leq {‖ M ‖}_{2}^{2} \leq {(C')}^{2} ζ d^{15 + 4 K} σ_{min} {(W^{*})}^{- 2}

, we obtain

L (W) \leq ζ \cdot 2 {(C')}^{2} μ_{2}^{2} d^{16 + 4 K} σ_{min} {(W^{*})}^{- 2} .

Finally, because

ζ \leq \frac{ϵ}{2 {(C')}^{2} μ_{2}^{2}} d^{- 16 - 4 K} σ_{min} {(W^{*})}^{2}

per (4), we conclude the proof of generalization bound, that is,

L (W) \leq ϵ

Remark 2.

The argument presented uses Emschwiller et al. [33, theorem 5.1]. Even though that result is stated for distributions supported on ${[- 1, 1]}^{d}$ , it still applies under the weaker assumption that the distribution has finite moments of all orders; see Emschwiller et al. [33, remark 5.5].

4.11.4. Case of Constant d: $d = O (1)$ .

We provide a very brief sketch for the argument in the case $d = O (1)$ . The argument is quite similar to the one in Theorem 2. Similar to the analysis (of $d = O (1)$ case) conducted for Theorem 2, we use the fact that, if X has a sub-Gaussian random variable, then $E {[| X |^{p}]}^{1 / p} = O (\sqrt{p})$ for every $p \geq 1$ , and in particular, $E [| X |^{p}] < \infty$ for all $p \geq 1$ ; see Vershynin [91, lemma 5.5] for a more precise statement.

Next, the upper bound on the energy value now modifies to $\frac{1}{2} \bar{C_{5}} σ_{min} {(W^{*})}^{4}$ .

4.11.5. Part (a).

Note that part (a) for the case of general d follows from earlier Claim 1. For the case when d is constant, part (a) now follows from modified Claim 1, provided under Theorem 2 for the case $d = O (1)$ . That is, for the event

E_{1} ≜ {\sup_{W : \hat{L} (W) \leq \frac{1}{2} \bar{C_{5}} σ_{min} {(W^{*})}^{4}} {‖ W ‖}_{F} = O (1)},

it is the case that

P (E_{1}) \geq 1 - O (1 / N)

, where

\bar{C_{5}}

is the constant appearing in Theorem 2.

Furthermore,

‖ \nabla^{2} \hat{L} (W) ‖ = Poly ({‖ W ‖}_{F}, \frac{1}{N} \sum_{1 \leq i \leq N} {‖ X_{i} ‖}_{2}^{D})

for some absolute constant D > 0, and for any constant D > 0,

\frac{1}{N} \sum_{1 \leq i \leq N} {‖ X_{i} ‖}_{2}^{D} = O (1)

with probability at least

1 - O (1 / N)

(where we use, in particular, the fact

E [X_{i} {(j)}^{2 D}] < \infty

Combining these, we find that

L ≜ \sup {‖ \nabla^{2} \hat{L} (W) ‖ : \hat{L} (W) \leq \frac{1}{2} \bar{C_{5}} σ_{min} {(W^{*})}^{4}} = O (1)

with probability at least

1 - O (1 / N)

4.11.6. Part (b).

The analysis for time horizon T remains intact. Furthermore, the entire analysis leading to (15) remains (nearly) intact, and this equation now modifies to

\hat{L} (W) \leq O (1) \cdot {‖ \frac{1}{N} \sum_{1 \leq i \leq N} r_{i} X_{i} X_{i}^{T} ‖}_{F},

because the Frobenius norm terms involved (all of which are polynomials in d) are now O(1). The event

E_{3}

appearing in (16) modifies now to

E_{3} ≜ {\inf_{\begin{matrix} W \in R^{m \times d} : σ_{min} (W) < \frac{1}{2} σ_{min} (W^{*}) \\ {‖ W ‖}_{F} \leq d^{K + 1} \end{matrix}} \hat{L} (W) \geq \frac{1}{2} \bar{C_{5}} σ_{min} {(W^{*})}^{4}},

which holds with probability at least

1 - O (1 / N)

(the modifications are exactly the same as those noted in Theorem 2 for the case

d = O (1)

). The rest of the analysis in Part (b) is exactly the same: combining modified versions of events

E_{1}

and

E_{3}

(note that there is no need to incorporate the event

E_{2}

, which for the case of general d, is required for truncation) via a union bound and recalling Equation (4) on ζ, and it follows that, with probability at least

1 - O (1 / N), \hat{L} (W) \leq ϵ

4.11.7. Part (c).

The modification for this part is as follows. First, we do not condition on $E_{2}$ as earlier. Instead, we apply Chebyshev’s inequality (elaborated subsequently). The entire analysis leading up to (20) remains the same. Note now that $Ξ^{T} v \in R^{d (d + 1) / 2}$ . For each coordinate ${(Ξ^{T} v)}_{i}$ of this vector, we have, using Chebyshev’s inequality,

| {(Ξ^{T} v)}_{i} | \leq O (\sqrt{N}) \cdot {‖ v ‖}_{2} \leq N \cdot \sqrt{κ} \cdot O (1),

with probability

1 - O (1 / N)

. Taking a union bound over

d (d + 1) / 2 = O (1)

coordinates yields that, with probability

1 - O (1 / N)

, this remains true over all

1 \leq i \leq d (d + 1) / 2

Next, to control ${‖ | {(Ξ^{T} Ξ)}^{- 1} ‖}_{2}^{2}$ , we do not need a delicate concentration result (such as Theorem 17) as earlier. Instead, we take the following route.

Fix $ϵ > 0$ to be tuned. Recall the notation $R_{i}$ from earlier, where $R_{i}$ is the $i th$ row of matrix $Ξ \in R^{N \times d (d + 1) / 2}$ and recall that $R_{i}, 1 \leq i \leq N$ are i.i.d. random vectors. Using the outer product representation of matrix multiplication as before, we have

\frac{1}{N} Ξ^{T} Ξ = \frac{1}{N} \sum_{1 \leq i \leq N} R_{i} R_{i}^{T} \in R^{\frac{d (d + 1)}{2} \times \frac{d (d + 1)}{2}} .

Consequently,

E [\frac{1}{N} Ξ^{T} Ξ] = E [R_{i} R_{i}^{T}] = Σ \in R^{\frac{d (d + 1)}{2} \times \frac{d (d + 1)}{2}},

where

E [\cdot]

acts entry-wise.

Because $d = O (1)$ , a simple application of Chebyshev’s inequality together with a union bound over $Θ (d^{4})$ entries (of $N^{- 1} Ξ^{T} Ξ$ ) yields

\max_{1 \leq i, j \leq d (d + 1) / 2} | {(\frac{1}{N} Ξ^{T} Ξ - Σ)}_{i j} | \leq ϵ

with probability at least

1 - O (1 / N)

. Using now

‖ M ‖ \leq {‖ M ‖}_{F}^{2}

valid for any matrix M, we obtain

‖ \frac{1}{N} Ξ^{T} Ξ - Σ ‖ \leq ϵ^{2} d^{4}

with probability

1 - O (1 / N)

. Similar to earlier,

σ_{min} (Σ) = Ω (1)

. Furthermore, the analysis starting from (22) and leading to (23) remains intact (with γ replaced with

ϵ^{2} d^{4}, ϵ > 0

to be tuned). In particular, for ϵ sufficiently small, it is the case that, with probability at least

1 - O (1 / N)

σ_{min} (Ξ) > \frac{N}{2} σ_{min} (Σ) .

The rest of the analysis remains intact except that the probability bounds are now modified to $1 - O (1 / N)$ . Finally, recalling the bound (4) on ζ, we find that, with probability $1 - O (1 / N)$ ,

{‖ W^{T} W - {(W^{*})}^{T} W^{*} ‖}_{F} \leq ϵ and L (W) \leq ϵ .

4.12. Proof of Theorem 8

Let $W_{0}^{T} W_{0} = m I_{d}$ , and let ${λ_{1}, \dots, λ_{d}} = σ ({(W^{*})}^{T} W^{*} - m I_{d})$ . In what follows, recall the quantities from the proof of Theorem 7(b): $σ_{*} ≜ Var ({(W_{i j}^{*})}^{2} - 1), χ_{2} ≜ \int x^{2} d ω (x)$ , where $ω (x)$ is the semicircle law. Fix now an arbitrary $ϵ > 0$ and a K > 0.

We start by defining several auxiliary events:

\begin{array}{l} E_{1} & ≜ {\sum_{1 \leq i \leq d} λ_{i}^{2} < 4 (1 + o (1)) m d^{2} χ_{2}}, \\ E_{2} & ≜ {| \sum_{1 \leq i \leq d} λ_{i} | < σ_{*} \sqrt{m d} d^{ϵ}}, \\ E_{3} & ≜ {σ_{min} {(W^{*})}^{4} \geq \frac{1}{16} m^{2}}, \\ E_{4} & ≜ {{‖ X_{i} ‖}_{\infty} \leq d^{1 / 2}, 1 \leq i \leq N} . \end{array}

Note that from the proof of Theorem 7(b) that we have $P (E_{i}) \geq 1 - o_{d} (1)$ for i = 1, 2, 3, and from the union bound and sub-Gaussianity of X, $P (E_{4}) \geq 1 - N \exp (- C d)$ . Thus,

P (\underset{1 \leq i \leq 4}{\cap} E_{i}) \geq 1 - o_{d} (1) - N \exp (- C d) .

In what follows, suppose we condition on the event $\cap_{1 \leq i \leq 4} E_{i}$ . Note that, in this conditional universe, it is still the case that X_i, $1 \leq i \leq N$ are i.i.d. random vectors with centered i.i.d. coordinates. Using now Hölder’s inequality (Theorem 18) with $p = 1, q = \infty, U = X_{i} X_{i}^{T}$ , and $V = {(W^{*})}^{T} W^{*} - m I_{d}$ , we arrive at

\begin{array}{l} | X_{i}^{T} ({(W^{*})}^{T} W^{*} - m I_{d}) X_{i} | & = | 〈 X_{i} X_{i}^{T}, {(W^{*})}^{T} W^{*} - m I_{d} 〉 | \\ \leq ‖ {(W^{*})}^{T} W^{*} - m I_{d} ‖ trace (X_{i} X_{i}^{T}) \\ \leq 2 \sqrt{m d} d^{2}, \end{array}

where we use the fact that

trace (X_{i} X_{i}^{T}) = {‖ X_{i} ‖}_{2}^{2} \leq d^{2}

(recall the conditioning on

E_{4}

). Using Hoeffding’s inequality, we have

\hat{L} (W_{0}) = \frac{1}{N} \sum_{1 \leq i \leq N} (X_{i}^{T} {(W^{*})}^{T} W^{*} X_{i} - X_{i}^{T} W_{0}^{T} W_{0} X_{i})^{2} \leq \frac{3}{2} L (W_{0}),

with probability at least

1 - \exp (- C' N d^{- 5} m^{- 1}),

where

L (W_{0}) = E [{(X^{T} {(W^{*})}^{T} W^{*} X - X^{T} W_{0}^{T} W_{0} X)}^{2} | {‖ X ‖}_{\infty} \leq d^{1 / 2}] .

Namely, $L (W_{0})$ is the population risk in the conditional universe.

Next, in this conditional space, using Theorem 12(c), we arrive at

L (W_{0}) \leq μ_{2} {(1 / 2)}^{2} {| \sum_{1 \leq i \leq d} λ_{i} |}^{2} + \max {μ_{4} (1 / 2) - μ_{2} {(1 / 2)}^{2}, 2 μ_{2} {(1 / 2)}^{2}} (\sum_{1 \leq i \leq d} λ_{i}^{2}) .

Finally, carrying out the same analysis as in the end of the proof of Theorem 7, we deduce

\hat{L} (W_{0}) < \frac{1}{2} C_{5} σ_{min} {(W^{*})}^{4},

provided

m > C^{″} d^{2}

for a large enough constant

C^{″}

, namely, provided that the network is sufficiently overparameterized.

4.13. Proof of Theorem 9

Let $span (X_{i} X_{i}^{T} : i \in [N]) = S$ , the set of all d × d symmetric matrices, and let $M \in S$ be such that for any i, $X_{i}^{T} M X_{i} = 0$ . We establish M = 0. Let $1 \leq k, ℓ \leq d$ be two fixed indices. To that end, let $θ_{i}^{(k, ℓ)} \in R$ be such that $\sum_{i = 1}^{N} θ_{i}^{(k, ℓ)} X_{i} X_{i}^{T} = e_{k} e_{ℓ}^{T} + e_{ℓ} e_{k}^{T}$ , where the column vectors $e_{k}, e_{ℓ} \in R^{d}$ are, respectively, the kth and $ℓ t h$ elements of the standard basis for $R^{d}$ . Such $θ_{i}^{(k, ℓ)}$ indeed exist because of the spanning property. Observe that $2 M_{k, ℓ} = e_{k}^{T} M e_{ℓ} + e_{ℓ}^{T} M e_{k} = tr (e_{k}^{T} M e_{ℓ} + e_{ℓ}^{T} M e_{k})$ . Now, using the fact that $tr (ABC) = tr (BCA) = tr (CAB)$ for all matrices A, B, C (with matching dimensions), we have
$2 M_{k, ℓ} = tr (M e_{ℓ} e_{k}^{T} + M e_{k} e_{ℓ}^{T}) = tr (\sum_{i = 1}^{N} θ_{i}^{(k, ℓ)} M X_{i} X_{i}^{T}) = \sum_{i = 1}^{N} θ_{i}^{(k, ℓ)} tr (X_{i}^{T} M X_{i}) = 0,$
for every $k, ℓ \in [d]$ . Finally, if W is such that $\hat{L} (W) = 0$ , then $X_{i}^{T} M X_{i} = 0$ for any i, where $M = {(W^{*})}^{T} W^{*} - W^{T} W$ . Hence, provided that the geometric condition holds, we have M = 0, that is, $W^{T} W = {(W^{*})}^{T} W^{*}$ . From here, the final conclusion follows per Theorem 14. Because $W^{T} W = {(W^{*})}^{T} W^{*}$ , W clearly has zero generalization error, that is, $L (W) = 0$ .
Our goal is to construct a $W \in R^{m \times d}$ with $f (W^{*}; X_{i}) = f (W; X_{i})$ for every $i \in [N]$ , whereas $W^{T} W \neq {(W^{*})}^{T} W^{*}$ . Consider the inner product $〈 A, B 〉 = trace (A B)$ in the space of all symmetric d × d matrices. Find $0 \neq M \in R^{d \times d}$ , a symmetric matrix, such that $M \in {span}^{⊥} (X_{i} X_{i}^{T} : i \in [N])$ , that is, $X_{i}^{T} M X_{i} = 0$ for every $i \in [N]$ . We can find such M satisfying ${‖ M ‖}_{2} = 1$ . Consider the linear matrix function $M (δ) = {(W^{*})}^{T} W^{*} + δ M$ . Note that $M (δ)$ is symmetric for every δ. We claim that, under the hypothesis of the theorem, there exists a $δ_{0} > 0$ such that $M (δ)$ is positive semidefinite for every $δ \in [0, δ_{0}]$ and that there exists $W_{δ} \in R^{m \times d}$ with $W_{δ}^{T} W_{δ} = M (δ)$ for all $δ \in [0, δ_{0}]$ . Observe that, because $rank (W^{*}) = d$ , then ${(W^{*})}^{T} W^{*} \in R^{d \times d}$ with $rank ({(W^{*})}^{T} W^{*}) = d$ . Therefore, the eigenvalues $λ_{1}^{*}, \dots, λ_{d}^{*}$ of ${(W^{*})}^{T} W^{*}$ are all positive. In particular ${λ_{i}^{*} : i \in [d]} \subset [δ_{1}, \infty)$ with $δ_{1} = σ_{min} {(W^{*})}^{2}$ . Now, let $μ_{1} (δ), \dots, μ_{d} (δ)$ be the eigenvalues of $M (δ)$ . Using Weyl’s inequality (Horn and Johnson [47]), we have $| μ_{i} (δ) - λ_{i}^{*} | \leq δ {‖ M ‖}_{2} = δ$ for every i. In particular, taking $δ \leq δ_{1}$ , we deduce, for every $i \in [d]$ , it holds that $μ_{i} (δ) \geq λ_{i}^{*} - δ_{1} \geq 0$ , that is, ${μ_{i} (δ) : i \in [d]} \subset [0, \infty)$ . In particular, we also have that $M (δ)$ is symmetric, and thus, it is positive semidefinite. Thus, there exists a $\bar{W_{δ}} \in R^{d \times d}$ such that ${\bar{W_{δ}}}^{T} \bar{W_{δ}} = M (δ)$ . Now, using the same idea as in the proof of Theorem 1(c), we then deduce that, for any $\hat{m} \geq d$ , there exists a matrix $W_{δ} \in R^{\hat{m} \times d}$ such that $W_{δ}^{T} W_{δ} = {\bar{W_{δ}}}^{T} \bar{W_{δ}} = M (δ)$ . In particular, for this $W_{δ}$ , if $f (W_{δ}, X)$ is the function computed by the neural network with weight matrix $W_{δ} \in R^{\hat{m} \times d}$ , then on the training data, $(X_{i} : i \in [N]), f (W_{δ}; X_{i}) = X_{i}^{T} W_{δ}^{T} W_{δ} X_{i} = X_{i}^{T} {(W^{*})}^{T} W^{*} X_{i} = f (W^{*}; X_{i})$ because $X_{i}^{T} M X_{i} = 0$ for all $i \in [N]$ . At the same time, $W_{δ}^{T} W_{δ} - {(W^{*})}^{T} W^{*} = δ M \neq 0$ because $δ \neq 0$ and $M \neq 0$ , and therefore, $W_{δ}^{T} W_{δ} \neq {(W^{*})}^{T} W^{*}$ .

Finally, to show $L (W_{δ}) > 0$ , we argue as follows. Suppose $L (W_{δ}) = 0$ . Then, by Theorem 13, it follows that $ψ (X) = X^{T} A X = 0$ identically, where $A = W_{δ}^{T} W_{δ} - {(W^{*})}^{T} W^{*}$ . Now, letting $ξ_{1}, \dots, ξ_{d}$ be the eigenvectors of A (with corresponding eigenvalues $λ_{1}, \dots, λ_{d}$ ), we obtain $ξ_{i}^{T} A ξ_{i} = λ_{i} ξ_{i}^{T} ξ_{i} = λ_{i} {‖ ξ_{i} ‖}_{2}^{2} = 0$ . We, namely, obtain $λ_{i} = 0$ for every $i \in [d]$ . Finally, because A is symmetric and, hence, admits a diagonalization of form $A = Q Λ Q$ with diagonal entries of Λ being zero, we deduce A is identically zero, which contradicts with the fact that $A = δ M$ , which is a nonzero matrix.

4.14. Proof of Theorem 10

Recall that $S = {M \in R^{d \times d} : M^{T} = M}$ . Note that this space has dimension $(\begin{matrix} d \\ 2 \end{matrix}) + d$ . For any $1 \leq k \leq ℓ \leq d$ , it is easy to see that the matrices $e_{k} e_{ℓ}^{T} + e_{ℓ} e_{k}^{T}$ are linearly independent, and there are precisely $(\begin{matrix} d \\ 2 \end{matrix}) + d$ such matrices. With this in mind, the statement of part (b) is immediate.

We now prove part (a) of the theorem. For any X_i, let $X_{i} (j)$ be the jth coordinate of X_i with $j \in [d]$ , and let $Y_{i}$ be a $d (d + 1) / 2 -$ dimensional vector obtained by retaining $X_{i} {(1)}^{2}, \dots, X_{i} {(d)}^{2}$ and the products $X_{i} (k) X_{i} (ℓ)$ with $1 \leq k < ℓ \leq d$ . Now, let $X$ be an $n \times d (d + 1) / 2$ matrix, whose rows are $Y_{1}, \dots, Y_{n}$ . Our goal is to establish

P [det (X) = 0] = 0,

when

n = d (d + 1) / 2

, where the probability is taken with respect to the randomness in

X_{1}, \dots, X_{n}

(in particular, this yields, for

n \geq d (d + 1) / 2, P (rank (X) = d (d + 1) / 2)

almost surely). Now, recalling Theorem 13, it then suffices to show that

det (X)

is not identically zero when viewed as a polynomial in

X_{i} (j)

with

i \in [N], j \in [d]

We now prove part (b) by providing a deterministic construction (of the matrix $X)$ under which $det (X) \neq 0$ . Let $p_{1} < \dots < p_{d}$ be distinct prime numbers. For every $1 \leq t \leq N$ , set

X_{t} = {(p_{1}^{t - 1}, \dots, p_{d}^{t - 1})}^{T} \in R^{d} .

In particular, $X_{1} = {(1, 1, \dots, 1)}^{T} \in R^{d}$ , which then implies that $Y_{1}$ is a vector of all ones. Now, we study $Y_{2}$ . The entries of $Y_{2}$ , called $z_{1}, \dots, z_{d (d + 1) / 2}$ , are of form $p_{i}^{2}$ with $i \in [d]$ or $p_{i} p_{j}$ , where $1 \leq i < j \leq d$ . By the fundamental theorem of arithmetic, we have $p_{i} p_{j} = p_{k} p_{ℓ} \Rightarrow {p_{i}, p_{j}} = {p_{k}, p_{ℓ}}$ , and therefore, $z_{1}, \dots, z_{d (d + 1) / 2}$ are pairwise distinct. With this construction, the matrix $X$ is a Vandermonde matrix with determinant

\prod_{1 \leq k < ℓ \leq d (d + 1) / 2} (z_{k} - z_{ℓ}) .

Because $z_{k} \neq z_{ℓ}$ for every $k \neq ℓ$ (from the construction on $Y_{2}$ , which, in turn, is constructed from X₂), this determinant is nonzero, proving the claim.

4.14. Proof of Theorem 11

Note that, if $N \geq N^{*}$ , then, combining parts (a) of Theorems 9 and 10, we have that, with probability one, $span (X_{i} X_{i}^{T} : i \in [N]) = S$ , which, together with $\hat{L} (W) = 0$ , imply that
$P (E \neq Ø) = 0,$
where $E = {W \in R^{m \times d} : W^{T} W \neq {(W^{*})}^{T} W^{*}; \hat{L} (W) = 0}$ from which the desired conclusion follows.
Assume W is taken as in proof of Theorem 9(b), that is,
$A = {(W^{*})}^{T} W^{*} - W^{T} W = δ M where δ = σ_{min} {(W^{*})}^{2} and ‖ M ‖ = 1,$
with $M^{T} = M$ . Let ${λ_{1}, \dots, λ_{d}}$ be the spectrum of the matrix δM. Using now Theorem 12(c), we have the lower bound
$\begin{array}{l} L (W) & \geq E {[X_{i} {(j)}^{2}]}^{2} trace {(A)}^{2} + \min {Var (X_{i} {(j)}^{2}), 2 E {[X_{i} {(j)}^{2}]}^{2}} \cdot trace (A^{2}) \\ \geq \min {Var (X_{i} {(j)}^{2}), 2 E {[X_{i} {(j)}^{2}]}^{2}} (\sum_{i = 1}^{d} λ_{i}^{2}) \\ \geq \min {Var (X_{i} {(j)}^{2}), 2 E {[X_{i} {(j)}^{2}]}^{2}} λ_{max} {(δ M)}^{2}, \end{array}$
because $trace (A^{2}) = \sum_{i = 1}^{d} λ_{i}^{2}$ . Finally, because $λ_{max} {(δ M)}^{2} = δ^{2} = σ_{min} {(W^{*})}^{4}$ (as the spectral norm of M is one), we arrive at the desired conclusion.

Acknowledgments

The authors thank the anonymous reviewers for their very detailed feedback, which improved the presentation of this paper, and Orestis Plevrakis for providing useful remarks on the initial version of this paper. Part of this work was done when D. Gamarnik and E. C. Kızıldağ were visiting the Simons Institute for the Theory of Computing at the University of California, Berkeley in Fall 2020.

References

[1] Arora S, Ge R, Neyshabur B, Zhang Y (2018) Stronger generalization bounds for deep nets via a compression approach. Preprint, submitted February 14, https://arxiv.org/abs/1802.05296.Google Scholar
[2] Arora S, Du SS, Hu W, Li Z, Wang R (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. Preprint, submitted January 24, https://arxiv.org/abs/1901.08584.Google Scholar
[3] Bai ZD, Yin YQ (1988) Convergence to the semicircle law. Ann. Probab. 16(2):863–875.Crossref, Google Scholar
[4] Bai ZD, Yin YQ (1993) Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. Ann. Probab. 21(3):1275–1294.Crossref, Google Scholar
[5] Barron AR (1994) Approximation and estimation bounds for artificial neural networks. Machine Learn. 14(1):115–133.Crossref, Google Scholar
[6] Bartlett PL, Foster DJ, Telgarsky MJ (2017) Spectrally-normalized margin bounds for neural networks. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 6240–6249.Google Scholar
[7] Bartlett PL, Harvey N, Liaw C, Mehrabian A (2019) Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J. Machine Learn. Res. 20(63):1–17.Google Scholar
[8] Bhatia R (2013) Matrix Analysis, vol. 169 (Springer Science & Business Media, Berlin, Heidelberg).Google Scholar
[9] Blum A, Rivest RL (1989) Training a 3-node neural network is NP-complete. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 494–501.Google Scholar
[10] Bölcskei H, Grohs P, Kutyniok G, Petersen P (2019) Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1(1):8–45.Google Scholar
[11] Brutzkus A, Globerson A (2017) Globally optimal gradient descent for a convnet with gaussian inputs. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 605–614.Google Scholar
[12] Brutzkus A, Globerson A, Malach E, Shalev-Shwartz S (2017) SGD learns over-parameterized networks that provably generalize on linearly separable data. Preprint, submitted October 27, https://arxiv.org/abs/1710.10174.Google Scholar
[13] Candès EJ, Li X (2014) Solving quadratic equations via phaselift when there are about as many equations as unknowns. Foundations Comput. Math. 14:1017–1026.Crossref, Google Scholar
[14] Candès EJ, Plan Y (2011) Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Trans. Inform. Theory 57(4):2342–2359.Crossref, Google Scholar
[15] Candès EJ, Strohmer T, Voroninski V (2013) Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math. 66(8):1241–1274.Crossref, Google Scholar
[16] Caron R, Traynor T (2005) The zero set of a polynomial. WSMR Report 05-02, University of Windsor, Windsor, Canada.Google Scholar
[17] Chen TQ, Rubanova Y, Bettencourt J, Duvenaud DK (2018) Neural ordinary differential equations. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 6571–6583.Google Scholar
[18] Chizat L, Bach F (2018) On the global convergence of gradient descent for over-parameterized models using optimal transport. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 3036–3046.Google Scholar
[19] Choromanska A, Henaff M, Mathieu M, Ben Arous G, LeCun Y (2015) The loss surfaces of multilayer networks. Artificial Intelligence Statist. (PMLR, New York), 192–204.Google Scholar
[20] Collobert R, Weston J (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. Proc. 25th Internat. Conf. Machine Learn. (ACM, New York), 160–167.Google Scholar
[21] De Fauw J, Ledsam JR, Romera-Paredes B, Nikolov S, Tomasev N, Blackwell S, Askham H, et al. (2018) Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature Medicine 24(9):1342–1350.Crossref, Google Scholar
[22] Demanet L, Hand P (2014) Stable optimizationless recovery from phaseless linear measurements. J. Fourier Anal. Appl. 20:199–221.Crossref, Google Scholar
[23] Deng Y, Li Z, Song Z (2023) An improved sample complexity for rank-1 matrix sensing. Preprint, submitted March 13, https://arxiv.org/abs/2303.06895.Google Scholar
[24] Du SS, Lee JD (2018) On the power of over-parametrization in neural networks with quadratic activation, Preprint, submitted March 3, https://arxiv.org/abs/1803.01206.Google Scholar
[25] Du SS, Lee JD, Tian Y (2017) When is a convolutional filter easy to learn? Preprint, submitted September 18, https://arxiv.org/abs/1709.06129.Google Scholar
[26] Du SS, Zhai X, Poczos B, Singh A (2018) Gradient descent provably optimizes over-parameterized neural networks. Preprint, submitted October 4, https://arxiv.org/abs/1810.02054.Google Scholar
[27] Du SS, Lee JD, Li H, Wang L, Zhai X (2018) Gradient descent finds global minima of deep neural networks. Preprint, submitted November 9, https://arxiv.org/abs/1811.03804.Google Scholar
[28] Du SS, Lee JD, Tian Y, Poczos B, Singh A (2017) Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. Preprint, submitted December 3, https://arxiv.org/abs/1712.00779.Google Scholar
[29] Du SS, Jin C, Lee JD, Jordan MI, Singh A, Poczos B (2017) Gradient descent can take exponential time to escape saddle points. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1067–1077.Google Scholar
[30] Dziugaite GK, Roy DM (2017) Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. Preprint, submitted March 31, https://arxiv.org/abs/1703.11008.Google Scholar
[31] Eldan R, Shamir O (2016) The power of depth for feedforward neural networks. Conf. Learn. Theory (PMLR, New York), 907–940.Google Scholar
[32] Eldar YC, Mendelson S (2014) Phase retrieval: Stability and recovery guarantees. Appl. Comput. Harmonic Anal. 36(3):473–494.Google Scholar
[33] Emschwiller M, Gamarnik D, Kızıldağ EC, Zadik I (2020) Neural networks and polynomial regression. Demystifying the overparametrization phenomena. Preprint, submitted March 23, https://arxiv.org/abs/2003.10523.Google Scholar
[34] Freeman CD, Bruna J (2016) Topology and geometry of half-rectified network optimization. Preprint, submitted November 4, https://arxiv.org/abs/1611.01540.Google Scholar
[35] Fulton W (2000) Eigenvalues, invariant factors, highest weights, and Schubert calculus. Bull. Amer. Math. Soc. 37(3):209–249.Crossref, Google Scholar
[36] Ge R, Lee JD, Ma T (2017) Learning one-hidden-layer neural networks with landscape design. Preprint, submitted November 1, https://arxiv.org/abs/1711.00501.Google Scholar
[37] Ge R, Huang F, Jin C, Yuan Y (2015) Escaping from saddle points—Online stochastic gradient for tensor decomposition. Conf. Learn. Theory (PMLR, New York), 797–842.Google Scholar
[38] Goel S, Kanade V, Klivans A, Thaler J (2016) Reliably learning the ReLU in polynomial time. Preprint, submitted November 30, https://arxiv.org/abs/1611.10258.Google Scholar
[39] Golowich N, Rakhlin A, Shamir O (2017) Size-independent sample complexity of neural networks. Preprint, submitted December 18, https://arxiv.org/abs/1712.06541.Google Scholar
[40] Gonon L, Grigoryeva L, Ortega J-P (2020) Approximation bounds for random neural networks and reservoir systems. Preprint, submitted February 14, https://arxiv.org/abs/2002.05933.Google Scholar
[41] Gunasekar S, Woodworth BE, Bhojanapalli S, Neyshabur B, Srebro N (2017) Implicit regularization in matrix factorization. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 30.Google Scholar
[42] Haeffele BD, Vidal R (2015) Global optimality in tensor factorization, deep learning, and beyond. Preprint, submitted June 24, https://arxiv.org/abs/1506.07540.Google Scholar
[43] Haeffele B, Young E, Vidal R (2014) Structured low-rank matrix factorization: Optimality, algorithm, and applications to image processing. Internat. Conf. Machine Learn. (PMLR, New York), 2007–2015.Google Scholar
[44] Hardt M, Ma T (2016) Identity matters in deep learning. Preprint, submitted November 14, https://arxiv.org/abs/1611.04231.Google Scholar
[45] Harvey N, Liaw C, Mehrabian A (2017) Nearly-tight VC-dimension bounds for piecewise linear neural networks. Conf. Learn. Theory (PMLR, New York), 1064–1068.Google Scholar
[46] He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE, Piscataway, NJ), 770–778.Google Scholar
[47] Horn RA, Johnson CR (2012) Matrix Analysis (Cambridge University Press, Cambridge, MA).Crossref, Google Scholar
[48] Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: Theory and applications. Neurocomputing 70(1–3):489–501.Crossref, Google Scholar
[49] Janzamin M, Sedghi H, Anandkumar A (2015) Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. Preprint, submitted June 28, https://arxiv.org/abs/1506.08473.Google Scholar
[50] Jin C, Ge R, Netrapalli P, Kakade SM, Jordan MI (2017) How to escape saddle points efficiently. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 1724–1732.Google Scholar
[51] Kawaguchi K (2016) Deep learning without poor local minima. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 586–594.Google Scholar
[52] Khodak M, Tenenholtz N, Mackey L, Fusi N (2021) Initialization and regularization of factorized neural layers. Preprint, submitted May 3, https://arxiv.org/abs/2105.01029.Google Scholar
[53] Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1097–1105.Google Scholar
[54] Lee JD, Simchowitz M, Jordan MI, Recht B (2016) Gradient descent only converges to minimizers. Conf. Learn. Theory (PMLR, New York), 1246–1257.Google Scholar
[55] Levy KY (2016) The power of normalization: Faster evasion of saddle points. Preprint, submitted November 15, https://arxiv.org/abs/1611.04831.Google Scholar
[56] Li Y, Yuan Y (2017) Convergence analysis of two-layer neural networks with ReLU activation. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 597–607.Google Scholar
[57] Li Y, Ma T, Zhang H (2018) Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. Conf. Learn. Theory (PMLR, New York), 2–47.Google Scholar
[58] Liang T, Poggio T, Rakhlin A, Stokes J (2017) Fisher-Rao metric, geometry, and complexity of neural networks. Preprint, submitted November 5, https://arxiv.org/abs/1711.01530.Google Scholar
[59] Livni R, Shalev-Shwartz S, Shamir O (2014) On the computational efficiency of training neural networks. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 855–863.Google Scholar
[60] Mannelli SS, Vanden-Eijnden E, Zdeborová L (2020) Optimization and generalization of shallow neural networks with quadratic activation functions. Preprint, submitted June 27, https://arxiv.org/abs/2006.15459.Google Scholar
[61] Mendelson S (2017) Extending the scope of the small-ball method. Preprint, submitted September 4, https://arxiv.org/abs/1709.00843.Google Scholar
[62] Mhaskar H, Liao Q, Poggio T (2016) Learning functions: When is deep better than shallow. Preprint, submitted March 3, https://arxiv.org/abs/1603.00988.Google Scholar
[63] Mohamed AR, Dahl GE, Hinton G (2011) Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Language Processing 20(1):14–22.Crossref, Google Scholar
[64] Neyshabur B, Bhojanapalli S, Srebro N (2017) A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. Preprint, submitted July 29, https://arxiv.org/abs/1707.09564.Google Scholar
[65] Neyshabur B, Tomioka R, Srebro N (2015) Norm-based capacity control in neural networks. Conf. Learn. Theory (PMLR, New York), 1376–1401.Google Scholar
[66] Neyshabur B, Bhojanapalli S, McAllester D, Srebro N (2017) Exploring generalization in deep learning. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 5947–5956.Google Scholar
[67] Nguyen Q, Hein M (2017) The loss surface of deep and wide neural networks. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 2603–2612.Google Scholar
[68] Nguyen Q, Hein M (2018) The loss surface and expressivity of deep convolutional neural networks (OpenReview.net).Google Scholar
[69] Pennington J, Worah P (2017) Nonlinear random matrix theory for deep learning. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 2637–2646.Google Scholar
[70] Poggio T, Mhaskar H, Rosasco L, Miranda B, Liao Q (2017) Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. Internat. J. Automation Comput. 14(5):503–519.Crossref, Google Scholar
[71] Poston T, Lee CN, Choie Y, Kwon Y (1991) Local minima and back propagation. IJCNN-91-Seattle Internat. Joint Conf. Neural Networks, vol. 2 (IEEE, Piscataway, NJ), 173–176.Google Scholar
[72] Qin L, Song Z, Zhang R (2023) A general algorithm for solving rank-one matrix sensing. Preprint, submitted March 22, https://arxiv.org/abs/2303.12298.Google Scholar
[73] Rahimi A, Recht B (2009) Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1313–1320.Google Scholar
[74] Recht B, Fazel M, Parrilo PA (2010) Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3):471–501.Crossref, Google Scholar
[75] Rotskoff GM, Vanden-Eijnden E (2018) Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. Preprint, submitted May 2, https://arxiv.org/abs/1805.00915.Google Scholar
[76] Rudin W (1964) Principles of Mathematical Analysis, vol. 3 (McGraw-Hill, New York).Google Scholar
[77] Safran I, Shamir O (2017) Spurious local minima are common in two-layer ReLU neural networks. Preprint, submitted December 24, https://arxiv.org/abs/1712.08968.Google Scholar
[78] Schmidt-Hieber J (2017) Nonparametric regression using deep neural networks with relu activation function. Preprint, submitted August 22, https://arxiv.org/abs/1708.06633.Google Scholar
[79] Sedghi H, Anandkumar A (2014) Provable methods for training neural networks with sparse connectivity. Preprint, submitted December 8, https://arxiv.org/abs/1412.2693.Google Scholar
[80] Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, et al. (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359.Crossref, Google Scholar
[81] Sirignano J, Spiliopoulos K (2020) Mean field analysis of neural networks: A central limit theorem. Stochastic Processes Appl. 130(3):1820–1852.Crossref, Google Scholar
[82] Soltanolkotabi M (2017) Learning ReLUs via gradient descent. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 2007–2017.Google Scholar
[83] Soltanolkotabi M, Javanmard A, Lee JD (2018) Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans. Inform. Theory 65(2):742–769.Google Scholar
[84] Song M, Montanari A, Nguyen P (2018) A mean field view of the landscape of two-layers neural networks. Proc. Natl. Acad. Sci. USA 115(33):E7665–E7671.Google Scholar
[85] Soudry D, Carmon Y (2016) No bad local minima: Data independent training error guarantees for multilayer neural networks. Preprint, submitted May 26, https://arxiv.org/abs/1605.08361.Google Scholar
[86] Soudry D, Hoffer E (2017) Exponentially vanishing sub-optimal local minima in multilayer neural networks. Preprint, submitted February 19, https://arxiv.org/abs/1702.05777.Google Scholar
[87] Stöger D, Soltanolkotabi M (2021) Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), vol. 34, 23831–23843.Google Scholar
[88] Telgarsky M (2016) Benefits of depth in neural networks. Preprint, submitted February 14, https://arxiv.org/abs/1602.04485.Google Scholar
[89] Tian Y (2017) An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 3404–3413.Google Scholar
[90] Venturi L, Bandeira AS, Bruna J (2019) Spurious valleys in one-hidden-layer neural network optimization landscapes. J. Machine Learn. Res. 20(133):1–34.Google Scholar
[91] Vershynin R (2010) Introduction to the non-asymptotic analysis of random matrices. Preprint, submitted November 12, https://arxiv.org/abs/1011.3027.Google Scholar
[92] Vodrahalli K, Shivanna R, Sathiamoorthy M, Jain S, Chi EH (2022) Nonlinear initialization methods for low-rank neural networks. Preprint, submitted February 2, https://arxiv.org/abs/2202.00834.Google Scholar
[93] Wei C, Lee JD, Liu Q, Ma T (2018) On the margin theory of feedforward neural networks. Preprint, submitted October 12, https://arxiv.org/abs/1810.05369.Google Scholar
[94] Weinan E, Han J, Jentzen A (2017) Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Comm. Math. Statist. 5(4):349–380.Crossref, Google Scholar
[95] Wu L, Zhu Z, W E (2017) Toward understanding generalization of deep learning: Perspective of loss landscapes. Preprint, submitted June 30, https://arxiv.org/abs/1706.10239.Google Scholar
[96] Zhang C, Bengio S, Hardt M, Recht B, Vinyals O (2016) Understanding deep learning requires rethinking generalization. Preprint, submitted November 10, https://arxiv.org/abs/1611.03530.Google Scholar
[97] Zhong K, Jain P, Dhillon IS (2015) Efficient matrix sensing using rank-1 gaussian measurements. Algorithmic Learn. Theory: 26th Internat. Conf. Proc. (Springer, Berlin, Heidelberg), 3–18.Google Scholar
[98] Zhong K, Song Z, Dhillon IS (2017) Learning non-overlapping convolutional neural networks with multiple kernels. Preprint, submitted November 8, https://arxiv.org/abs/1711.03440.Google Scholar
[99] Zhong K, Song Z, Jain P, Bartlett PL, Dhillon IS (2017) Recovery guarantees for one-hidden-layer neural networks. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 4140–4149.Google Scholar
[100] Zhou P, Feng J (2017) The landscape of deep learning algorithms. Preprint, submitted May 19, https://arxiv.org/abs/1705.07038.Google Scholar
[101] Zhou Y, Liang Y (2017) Critical points of neural networks: Analytical forms and landscape properties. Preprint, submitted October 30, https://arxiv.org/abs/1710.11205.Google Scholar

cover image Mathematics of Operations Research

Volume 50, Issue 1

February 2025

Pages 1-781 C2

Article Information

Metrics

Information

Received:March 30, 2021
Accepted:December 10, 2023
Published Online:February 20, 2024

Cite as

David Gamarnik; , Eren C. Kızıldağ; , Ilias Zadik (2024) Stationary Points of a Shallow Neural Network with Quadratic Activations and the Global Optimality of the Gradient Descent Algorithm. Mathematics of Operations Research 50(1):209-251.

https://doi.org/10.1287/moor.2021.0082

Keywords

Acknowledgments

PDF download

Available Issues

Available Issues

Stationary Points of a Shallow Neural Network with Quadratic Activations and the Global Optimality of the Gradient Descent Algorithm

Abstract

1. Introduction

1.1. Model, Contributions, and Comparison with Prior Work

1.1.1. Model.

1.1.2. Contributions.

1.1.3. Comparison with Soltanolkotabi et al. [83] and Du and Lee [24].

1.1.4. Connection to Matrix Sensing.

1.1.5. Further Relevant Prior Work.

1.1.6. A Follow-up Work.

1.1.7. Paper Organization.

1.1.8. Notation.

2. Main Results

2.1. Optimization Landscape

2.1.1. Existence of an Energy Barrier.

2.1.2. Global Optimality of Full-Rank Stationary Points.

2.1.3. Convergence of Gradient Descent.

2.2. On Initialization: Randomly Generated Planted Weights

2.3. Critical Number of Training Samples

2.3.1. A Necessary and Sufficient Geometrical Condition.

2.3.2. Randomized Data Enjoys the Geometric Condition.

2.3.3. Sample Complexity Bound for the Planted Network Model.

2.3.4. Related Work.

3. Preliminaries

3.1. An Analytical Expression for the Population Risk

3.2. Useful Lemmas and Results from Linear Algebra and Random Matrix Theory

4. Proofs

4.1. Proof of Theorem 12

4.2. Proof of Theorem 1

4.3. Proof of Lemma 1

4.4. Proof of Lemma 2

4.5. Proof of Lemma 3

4.6. Proof of Theorem 3

4.7. Proof of Theorem 5

4.8. Proof of Theorem 7

4.8.1. Part (a).

4.8.2. Part (b).

4.9. Proof of Theorem 2

4.9.1. Case of Constant d: d=O(1).

4.9.2. Modifying Claim 1.

4.9.3. Modifying Claim 2.

4.10. Proof of Theorem 4

4.11. Proof of Theorem 6

4.11.1. Part (a).

4.11.2. Part (b).

4.11.3. Part (c).

4.11.4. Case of Constant d: d=O(1).

4.11.5. Part (a).

4.11.6. Part (b).

4.11.7. Part (c).

4.12. Proof of Theorem 8

4.13. Proof of Theorem 9

4.14. Proof of Theorem 10

4.14. Proof of Theorem 11

References

Volume 50, Issue 1

Article Information

Metrics

Information

Cite as

Keywords

4.9.1. Case of Constant d: $d = O (1)$ .

4.11.4. Case of Constant d: $d = O (1)$ .